[PoC] Improve dead tuple storage for lazy vacuum
Hi all,
Index vacuuming is one of the most time-consuming processes in lazy
vacuuming. lazy_tid_reaped() is a large part among them. The attached
the flame graph shows a profile of a vacuum on a table that has one index
and 80 million live rows and 20 million dead rows, where
lazy_tid_reaped() accounts for about 47% of the total vacuum execution
time.
lazy_tid_reaped() is essentially an existence check; for every index
tuple, it checks if the TID of the heap it points to exists in the set
of TIDs of dead tuples. The maximum size of dead tuples is limited by
maintenance_work_mem, and if the upper limit is reached, the heap scan
is suspended, index vacuum and heap vacuum are performed, and then
heap scan is resumed again. Therefore, in terms of the performance of
index vacuuming, there are two important factors: the performance of
lookup TIDs from the set of dead tuples and its memory usage. The
former is obvious whereas the latter affects the number of Index
vacuuming. In many index AMs, index vacuuming (i.e., ambulkdelete)
performs a full scan of the index, so it is important in terms of
performance to avoid index vacuuming from being executed more than
once during lazy vacuum.
Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:
1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1]/messages/by-id/CAGTBQpbDCaR6vv9=scXzuT8fSbckf=a3NgZdWFWZbdVugVht6Q@mail.gmail.com.
2. Allocate the whole memory space at once.
3. Slow lookup performance (O(logN)).
I’ve done some experiments in this area and would like to share the
results and discuss ideas.
Problems Solutions
===============
Firstly, I've considered using existing data structures:
IntegerSet(src/backend/lib/integerset.c) and
TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but
only either point 2 or 3. IntegerSet uses lower memory thanks to
simple-8b encoding but is slow at lookup, still O(logN), since it’s a
tree structure. On the other hand, TIDBitmap has a good lookup
performance, O(1), but could unnecessarily use larger memory in some
cases since it always allocates the space for bitmap enough to store
all possible offsets. With 8kB blocks, the maximum number of line
pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the
bitmap is 40 bytes long and we always need 46 bytes in total per block
including other meta information.
So I prototyped a new data structure dedicated to storing dead tuples
during lazy vacuum while borrowing the idea from Roaring Bitmap[2]http://roaringbitmap.org/.
The authors provide an implementation of Roaring Bitmap[3]https://github.com/RoaringBitmap/CRoaring (Apache
2.0 license). But I've implemented this idea from scratch because we
need to integrate it with Dynamic Shared Memory/Area to support
parallel vacuum and need to support ItemPointerData, 6-bytes integer
in total, whereas the implementation supports only 4-bytes integers.
Also, when it comes to vacuum, we neither need to compute the
intersection, the union, nor the difference between sets, but need
only an existence check.
The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.
For example, if there are two dead tuples at offset 1 and 150, it uses
the array container that has an array of two 2-byte integers
representing 1 and 150, using 4 bytes in total. If we used the bitmap
container in this case, we would need 20 bytes instead. On the other
hand, if there are consecutive 20 dead tuples from offset 1 to 20, it
uses the run container that has an array of 2-byte integers. The first
value in each pair represents a starting offset number, whereas the
second value represents its length. Therefore, in this case, the run
container uses only 4 bytes in total. Finally, if there are dead
tuples at every other offset from 1 to 100, it uses the bitmap
container that has an uncompressed bitmap, using 13 bytes. We need
another 16 bytes per block entry for hash table entry.
The lookup complexity of a bitmap container is O(1) whereas the one of
an array and a run container is O(N) or O(logN) but the number of
elements in those two containers should not be large it would not be a
problem.
Evaluation
========
Before implementing this idea and integrating it with lazy vacuum
code, I've implemented a benchmark tool dedicated to evaluating
lazy_tid_reaped() performance[4]https://github.com/MasahikoSawada/pgtools/tree/master/bdbench. It has some functions: generating
TIDs for both index tuples and dead tuples, loading dead tuples to the
data structure, simulating lazy_tid_reaped() using those virtual heap
tuples and heap dead tuples. So the code lacks many features such as
iteration and DSM/DSA support but it makes testing of data structure
easier.
FYI I've confirmed the validity of this tool. When I ran a vacuum on
the table with 3GB size, index vacuuming took 12.3 sec and
lazy_tid_reaped() took approximately 8.5 sec. Simulating a similar
situation with the tool, the lookup benchmark with the array data
structure took approximately 8.0 sec. Given that the tool doesn't
simulate the cost of function calls, it seems to reasonably simulate
it.
I've evaluated the lookup performance and memory foot point against
the four types of data structure: array, integerset (intset),
tidbitmap (tbm), roaring tidbitmap (rtbm) while changing the
distribution of dead tuples in blocks. Since tbm doesn't have a
function for existence check I've added it and allocate enough memory
to make sure that tbm never be lossy during the evaluation. In all
test cases, I simulated that the table has 1,000,000 blocks and every
block has at least one dead tuple. The benchmark scenario is that for
each virtual heap tuple we check if there is its TID in the dead
tuple storage. Here are the results of execution time in milliseconds
and memory usage in bytes:
* Test-case 1 (10 dead tuples in 20 offsets interval)
An array container is selected in this test case, using 20 bytes for each block.
Execution Time Memory Usage
array 14,140.91 60,008,248
intset 9,350.08 50,339,840
tbm 1,299.62 100,671,544
rtbm 1,892.52 58,744,944
* Test-case 2 (10 consecutive dead tuples from offset 1)
A bitmap container is selected in this test case, using 2 bytes for each block.
Execution Time Memory Usage
array 1,056.60 60,008,248
intset 650.85 50,339,840
tbm 194.61 100,671,544
rtbm 154.57 27,287,664
* Test-case 3 (2 dead tuples at 1 and 100 offsets)
An array container is selected in this test case, using 4 bytes for
each block. Since 'array' data structure (not array container of rtbm)
uses only 12 bytes for each block, given that the size of hash table
entry size in 'rtbm', 'array' data structure uses less memory.
Execution Time Memory Usage
array 6,054.22 12,008,248
intset 4,203.41 16,785,408
tbm 759.17 100,671,544
rtbm 750.08 29,384,816
* Test-case 4 (100 consecutive dead tuples from 1)
A run container is selected in this test case, using 4 bytes for each block.
Execution Time Memory Usage
array 8,883.03 600,008,248
intset 7,358.23 100,671,488
tbm 758.81 100,671,544
rtbm 764.33 29,384,816
Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.
Feedback is very welcome. Thank you for reading the email through to the end.
Regards,
[1]: /messages/by-id/CAGTBQpbDCaR6vv9=scXzuT8fSbckf=a3NgZdWFWZbdVugVht6Q@mail.gmail.com
[2]: http://roaringbitmap.org/
[3]: https://github.com/RoaringBitmap/CRoaring
[4]: https://github.com/MasahikoSawada/pgtools/tree/master/bdbench
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Attachments:
On Wed, 7 Jul 2021 at 13:47, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Hi all,
Index vacuuming is one of the most time-consuming processes in lazy
vacuuming. lazy_tid_reaped() is a large part among them. The attached
the flame graph shows a profile of a vacuum on a table that has one index
and 80 million live rows and 20 million dead rows, where
lazy_tid_reaped() accounts for about 47% of the total vacuum execution
time.[...]
Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.
Those are some great results, with a good path to meaningful improvements.
Feedback is very welcome. Thank you for reading the email through to the end.
The current available infrastructure for TIDs is quite ill-defined for
TableAM authors [0]/messages/by-id/0bbeb784050503036344e1f08513f13b2083244b.camel@j-davis.com, and other TableAMs might want to use more than
just the 11 bits in use by max-BLCKSZ HeapAM MaxHeapTuplesPerPage to
identify tuples. (MaxHeapTuplesPerPage is 1169 at the maximum 32k
BLCKSZ, which requires 11 bits to fit).
Could you also check what the (performance, memory) impact would be if
these proposed structures were to support the maximum
MaxHeapTuplesPerPage of 1169 or the full uint16-range of offset
numbers that could be supported by our current TID struct?
Kind regards,
Matthias van de Meent
[0]: /messages/by-id/0bbeb784050503036344e1f08513f13b2083244b.camel@j-davis.com
On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].
I think that the main problem with the 1GB limitation is that it is
surprising -- it can cause disruption when we first exceed the magical
limit of ~174 million TIDs. This can cause us to dirty index pages a
second time when we might have been able to just do it once with
sufficient memory for TIDs. OTOH there are actually cases where having
less memory for TIDs makes performance *better* because of locality
effects. This perverse behavior with memory sizing isn't a rare case
that we can safely ignore -- unfortunately it's fairly common.
My point is that we should be careful to choose the correct goal.
Obviously memory use matters. But it might be more helpful to think of
memory use as just a proxy for what truly matters, not a goal in
itself. It's hard to know what this means (what is the "real goal"?),
and hard to measure it even if you know for sure. It could still be
useful to think of it like this.
A run container is selected in this test case, using 4 bytes for each block.
Execution Time Memory Usage
array 8,883.03 600,008,248
intset 7,358.23 100,671,488
tbm 758.81 100,671,544
rtbm 764.33 29,384,816Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.
This seems very promising.
I wonder how much you have thought about the index AM side. It makes
sense to initially evaluate these techniques using this approach of
separating the data structure from how it is used by VACUUM -- I think
that that was a good idea. But at the same time there may be certain
important theoretical questions that cannot be answered this way --
questions about how everything "fits together" in a real VACUUM might
matter a lot. You've probably thought about this at least a little
already. Curious to hear how you think it "fits together" with the
work that you've done already.
The loop inside btvacuumpage() makes each loop iteration call the
callback -- this is always a call to lazy_tid_reaped() in practice.
And that's where we do binary searches. These binary searches are
usually where we see a huge number of cycles spent when we look at
profiles, including the profile that produced your flame graph. But I
worry that that might be a bit misleading -- the way that profilers
attribute costs is very complicated and can never be fully trusted.
While it is true that lazy_tid_reaped() often accesses main memory,
which will of course add a huge amount of latency and make it a huge
bottleneck, the "big picture" is still relevant.
I think that the compiler currently has to make very conservative
assumptions when generating the machine code used by the loop inside
btvacuumpage(), which calls through an opaque function pointer at
least once per loop iteration -- anything can alias, so the compiler
must be conservative. The data dependencies are hard for both the
compiler and the CPU to analyze. The cost of using a function pointer
compared to a direct function call is usually quite low, but there are
important exceptions -- cases where it prevents other useful
optimizations. Maybe this is an exception.
I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.
This approach would make btbulkdelete() similar to
_bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
an independent idea to your ideas -- I imagine that this would work
far better when combined with a more compact data structure, which is
naturally more capable of batch processing than a simple array of
TIDs. Maybe this will help the compiler and the CPU to fully
understand the *natural* data dependencies, so that they can be as
effective as possible in making the code run fast. It's possible that
a modern CPU will be able to *hide* the latency more intelligently
than what we have today. The latency is such a big problem that we may
be able to justify "wasting" other CPU resources, just because it
sometimes helps with hiding the latency. For example, it might
actually be okay to sort all of the TIDs on the page to make the bulk
processing work -- though you might still do a precheck that is
similar to the precheck inside lazy_tid_reaped() that was added by you
in commit bbaf315309e.
Of course it's very easy to be wrong about stuff like this. But it
might not be that hard to prototype. You can literally copy and paste
code from _bt_delitems_delete_check() to do this. It does the same
basic thing already.
--
Peter Geoghegan
On Wed, Jul 7, 2021 at 1:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.
Maybe for something like rtbm.c (which is inspired by Roaring
bitmaps), you would really want to use an "intersection" operation for
this. The TIDs that we need to physically delete from the leaf page
inside btvacuumpage() are the intersection of two bitmaps: our bitmap
of all TIDs on the leaf page, and our bitmap of all TIDs that need to
be deleting by the ongoing btbulkdelete() call.
Obviously the typical case is that most TIDs in the index do *not* get
deleted -- needing to delete more than ~20% of all TIDs in the index
will be rare. Ideally it would be very cheap to figure out that a TID
does not need to be deleted at all. Something a little like a negative
cache (but not a true negative cache). This is a little bit like how
hash joins can be made faster by adding a Bloom filter -- most hash
probes don't need to join a tuple in the real world, and we can make
these hash probes even faster by using a Bloom filter as a negative
cache.
If you had the list of TIDs from a leaf page sorted for batch
processing, and if you had roaring bitmap style "chunks" with
"container" metadata stored in the data structure, you could then use
merging/intersection -- that has some of the same advantages. I think
that this would be a lot more efficient than having one binary search
per TID. Most TIDs from the leaf page can be skipped over very
quickly, in large groups. It's very rare for VACUUM to need to delete
TIDs from completely random heap table blocks in the real world (some
kind of pattern is much more common).
When this merging process finds 1 TID that might really be deletable
then it's probably going to find much more than 1 -- better to make
that cache miss take care of all of the TIDs together. Also seems like
the CPU could do some clever prefetching with this approach -- it
could prefetch TIDs where the initial chunk metadata is insufficient
to eliminate them early -- these are the groups of TIDs that will have
many TIDs that we actually need to delete. ISTM that improving
temporal locality through batching could matter a lot here.
--
Peter Geoghegan
On Wed, Jul 7, 2021 at 11:25 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Wed, 7 Jul 2021 at 13:47, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Hi all,
Index vacuuming is one of the most time-consuming processes in lazy
vacuuming. lazy_tid_reaped() is a large part among them. The attached
the flame graph shows a profile of a vacuum on a table that has one index
and 80 million live rows and 20 million dead rows, where
lazy_tid_reaped() accounts for about 47% of the total vacuum execution
time.[...]
Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.Those are some great results, with a good path to meaningful improvements.
Feedback is very welcome. Thank you for reading the email through to the end.
The current available infrastructure for TIDs is quite ill-defined for
TableAM authors [0], and other TableAMs might want to use more than
just the 11 bits in use by max-BLCKSZ HeapAM MaxHeapTuplesPerPage to
identify tuples. (MaxHeapTuplesPerPage is 1169 at the maximum 32k
BLCKSZ, which requires 11 bits to fit).Could you also check what the (performance, memory) impact would be if
these proposed structures were to support the maximum
MaxHeapTuplesPerPage of 1169 or the full uint16-range of offset
numbers that could be supported by our current TID struct?
I think tbm will be the most affected by the memory impact of the
larger maximum MaxHeapTuplesPerPage. For example, with 32kB blocks
(MaxHeapTuplesPerPage = 1169), even if there is only one dead tuple in
a block, it will always require at least 147 bytes per block.
Rtbm chooses the container type among array, bitmap, or run depending
on the number and distribution of dead tuples in a block, and only
bitmap containers can be searched with O(1). Run containers depend on
the distribution of dead tuples within a block. So let’s compare array
and bitmap containers.
With 8kB blocks (MaxHeapTuplesPerPage = 291), 36 bytes are needed for
a bitmap container at maximum. In other words, when compared to an
array container, bitmap will be chosen if there are more than 18 dead
tuples in a block. On the other hand, with 32kB blocks
(MaxHeapTuplesPerPage = 1169), 147 bytes are needed for a bitmap
container at maximum, so bitmap container will be chosen if there are
more than 74 dead tuples in a block. And, with full uint16-range
(MaxHeapTuplesPerPage = 65535), 8192 bytes are needed at maximum, so
bitmap container will be chosen if there are more than 4096 dead
tuples in a block. Therefore, in any case, if more than about 6% of
tuples in a block are garbage, a bitmap container will be chosen and
bring a faster lookup performance. (Of course, if a run container is
chosen, the container size gets smaller but the lookup performance is
O(logN).) But if the number of dead tuples in the table is small and
we have the larger MaxHeapTuplesPerPage, it’s likely to choose an
array container, and the lookup performance becomes O(logN). Still, it
should be faster than the array data structure because the range of
search targets in an array container is much smaller.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Thu, Jul 8, 2021 at 5:24 AM Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].I think that the main problem with the 1GB limitation is that it is
surprising -- it can cause disruption when we first exceed the magical
limit of ~174 million TIDs. This can cause us to dirty index pages a
second time when we might have been able to just do it once with
sufficient memory for TIDs. OTOH there are actually cases where having
less memory for TIDs makes performance *better* because of locality
effects. This perverse behavior with memory sizing isn't a rare case
that we can safely ignore -- unfortunately it's fairly common.My point is that we should be careful to choose the correct goal.
Obviously memory use matters. But it might be more helpful to think of
memory use as just a proxy for what truly matters, not a goal in
itself. It's hard to know what this means (what is the "real goal"?),
and hard to measure it even if you know for sure. It could still be
useful to think of it like this.
As I wrote in the first email, I think there are two important factors
in index vacuuming performance: the performance to check if heap TID
that an index tuple points to is dead, and the number of times to
perform index bulk-deletion. The flame graph I attached in the first
mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a
disk-intensive operation in practice. Given that most index AM's
bulk-deletion does a full index scan and a table could have multiple
indexes, reducing the number of times to perform index bulk-deletion
really contributes to reducing the execution time, especially for
large tables. I think that a more compact data structure for dead
tuple TIDs is one of the ways to achieve that.
A run container is selected in this test case, using 4 bytes for each block.
Execution Time Memory Usage
array 8,883.03 600,008,248
intset 7,358.23 100,671,488
tbm 758.81 100,671,544
rtbm 764.33 29,384,816Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.This seems very promising.
I wonder how much you have thought about the index AM side. It makes
sense to initially evaluate these techniques using this approach of
separating the data structure from how it is used by VACUUM -- I think
that that was a good idea. But at the same time there may be certain
important theoretical questions that cannot be answered this way --
questions about how everything "fits together" in a real VACUUM might
matter a lot. You've probably thought about this at least a little
already. Curious to hear how you think it "fits together" with the
work that you've done already.
Yeah, that definitely needs to be considered. Currently, what we need
for the dead tuple storage for lazy vacuum are store, lookup, and
iteration. And given the parallel vacuum, it has to be able to be
allocated on DSM or DSA. While implementing the PoC code, I'm trying
to integrate it with the current lazy vacuum code. As far as I've seen
so far, the integration is not hard, at least with the *current* lazy
vacuum code and index AMs code.
The loop inside btvacuumpage() makes each loop iteration call the
callback -- this is always a call to lazy_tid_reaped() in practice.
And that's where we do binary searches. These binary searches are
usually where we see a huge number of cycles spent when we look at
profiles, including the profile that produced your flame graph. But I
worry that that might be a bit misleading -- the way that profilers
attribute costs is very complicated and can never be fully trusted.
While it is true that lazy_tid_reaped() often accesses main memory,
which will of course add a huge amount of latency and make it a huge
bottleneck, the "big picture" is still relevant.I think that the compiler currently has to make very conservative
assumptions when generating the machine code used by the loop inside
btvacuumpage(), which calls through an opaque function pointer at
least once per loop iteration -- anything can alias, so the compiler
must be conservative. The data dependencies are hard for both the
compiler and the CPU to analyze. The cost of using a function pointer
compared to a direct function call is usually quite low, but there are
important exceptions -- cases where it prevents other useful
optimizations. Maybe this is an exception.I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.This approach would make btbulkdelete() similar to
_bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
an independent idea to your ideas -- I imagine that this would work
far better when combined with a more compact data structure, which is
naturally more capable of batch processing than a simple array of
TIDs. Maybe this will help the compiler and the CPU to fully
understand the *natural* data dependencies, so that they can be as
effective as possible in making the code run fast. It's possible that
a modern CPU will be able to *hide* the latency more intelligently
than what we have today. The latency is such a big problem that we may
be able to justify "wasting" other CPU resources, just because it
sometimes helps with hiding the latency. For example, it might
actually be okay to sort all of the TIDs on the page to make the bulk
processing work -- though you might still do a precheck that is
similar to the precheck inside lazy_tid_reaped() that was added by you
in commit bbaf315309e.
Interesting idea. I remember you mentioned this idea somewhere and
I've considered this idea too while implementing the PoC code. It's
definitely worth trying. Maybe we can write a patch for this as a
separate patch? It will change index AM and could improve also the
current bulk-deletion. We can consider a better data structure on top
of this idea.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Very nice results.
I have been working on the same problem but a bit different solution -
a mix of binary search for (sub)pages and 32-bit bitmaps for
tid-in-page.
Even with currebnt allocation heuristics (allocate 291 tids per page)
it initially allocate much less space, instead of current 291*6=1746
bytes per page it needs to allocate 80 bytes.
Also it can be laid out so that it is friendly to parallel SIMD
searches doing up to 8 tid lookups in parallel.
That said, for allocating the tid array, the best solution is to
postpone it as much as possible and to do the initial collection into
a file, which
1) postpones the memory allocation to the beginning of index cleanups
2) lets you select the correct size and structure as you know more
about the distribution at that time
3) do the first heap pass in one go and then advance frozenxmin
*before* index cleanup
Also, collecting dead tids into a file makes it trivial (well, almost
:) ) to parallelize the initial heap scan, so more resources can be
thrown at it if available.
Cheers
-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.
Show quoted text
On Thu, Jul 8, 2021 at 10:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Jul 8, 2021 at 5:24 AM Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].I think that the main problem with the 1GB limitation is that it is
surprising -- it can cause disruption when we first exceed the magical
limit of ~174 million TIDs. This can cause us to dirty index pages a
second time when we might have been able to just do it once with
sufficient memory for TIDs. OTOH there are actually cases where having
less memory for TIDs makes performance *better* because of locality
effects. This perverse behavior with memory sizing isn't a rare case
that we can safely ignore -- unfortunately it's fairly common.My point is that we should be careful to choose the correct goal.
Obviously memory use matters. But it might be more helpful to think of
memory use as just a proxy for what truly matters, not a goal in
itself. It's hard to know what this means (what is the "real goal"?),
and hard to measure it even if you know for sure. It could still be
useful to think of it like this.As I wrote in the first email, I think there are two important factors
in index vacuuming performance: the performance to check if heap TID
that an index tuple points to is dead, and the number of times to
perform index bulk-deletion. The flame graph I attached in the first
mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a
disk-intensive operation in practice. Given that most index AM's
bulk-deletion does a full index scan and a table could have multiple
indexes, reducing the number of times to perform index bulk-deletion
really contributes to reducing the execution time, especially for
large tables. I think that a more compact data structure for dead
tuple TIDs is one of the ways to achieve that.A run container is selected in this test case, using 4 bytes for each block.
Execution Time Memory Usage
array 8,883.03 600,008,248
intset 7,358.23 100,671,488
tbm 758.81 100,671,544
rtbm 764.33 29,384,816Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.This seems very promising.
I wonder how much you have thought about the index AM side. It makes
sense to initially evaluate these techniques using this approach of
separating the data structure from how it is used by VACUUM -- I think
that that was a good idea. But at the same time there may be certain
important theoretical questions that cannot be answered this way --
questions about how everything "fits together" in a real VACUUM might
matter a lot. You've probably thought about this at least a little
already. Curious to hear how you think it "fits together" with the
work that you've done already.Yeah, that definitely needs to be considered. Currently, what we need
for the dead tuple storage for lazy vacuum are store, lookup, and
iteration. And given the parallel vacuum, it has to be able to be
allocated on DSM or DSA. While implementing the PoC code, I'm trying
to integrate it with the current lazy vacuum code. As far as I've seen
so far, the integration is not hard, at least with the *current* lazy
vacuum code and index AMs code.The loop inside btvacuumpage() makes each loop iteration call the
callback -- this is always a call to lazy_tid_reaped() in practice.
And that's where we do binary searches. These binary searches are
usually where we see a huge number of cycles spent when we look at
profiles, including the profile that produced your flame graph. But I
worry that that might be a bit misleading -- the way that profilers
attribute costs is very complicated and can never be fully trusted.
While it is true that lazy_tid_reaped() often accesses main memory,
which will of course add a huge amount of latency and make it a huge
bottleneck, the "big picture" is still relevant.I think that the compiler currently has to make very conservative
assumptions when generating the machine code used by the loop inside
btvacuumpage(), which calls through an opaque function pointer at
least once per loop iteration -- anything can alias, so the compiler
must be conservative. The data dependencies are hard for both the
compiler and the CPU to analyze. The cost of using a function pointer
compared to a direct function call is usually quite low, but there are
important exceptions -- cases where it prevents other useful
optimizations. Maybe this is an exception.I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.This approach would make btbulkdelete() similar to
_bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
an independent idea to your ideas -- I imagine that this would work
far better when combined with a more compact data structure, which is
naturally more capable of batch processing than a simple array of
TIDs. Maybe this will help the compiler and the CPU to fully
understand the *natural* data dependencies, so that they can be as
effective as possible in making the code run fast. It's possible that
a modern CPU will be able to *hide* the latency more intelligently
than what we have today. The latency is such a big problem that we may
be able to justify "wasting" other CPU resources, just because it
sometimes helps with hiding the latency. For example, it might
actually be okay to sort all of the TIDs on the page to make the bulk
processing work -- though you might still do a precheck that is
similar to the precheck inside lazy_tid_reaped() that was added by you
in commit bbaf315309e.Interesting idea. I remember you mentioned this idea somewhere and
I've considered this idea too while implementing the PoC code. It's
definitely worth trying. Maybe we can write a patch for this as a
separate patch? It will change index AM and could improve also the
current bulk-deletion. We can consider a better data structure on top
of this idea.Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Resending as forgot to send to the list (thanks Peter :) )
On Wed, Jul 7, 2021 at 10:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
The loop inside btvacuumpage() makes each loop iteration call the
callback -- this is always a call to lazy_tid_reaped() in practice.
And that's where we do binary searches. These binary searches are
usually where we see a huge number of cycles spent when we look at
profiles, including the profile that produced your flame graph. But I
worry that that might be a bit misleading -- the way that profilers
attribute costs is very complicated and can never be fully trusted.
While it is true that lazy_tid_reaped() often accesses main memory,
which will of course add a huge amount of latency and make it a huge
bottleneck, the "big picture" is still relevant.
This is why I have mainly focused on making it possible to use SIMD and
run 4-8 binary searches in parallel, mostly 8, for AVX2.
How I am approaching this is separating "page search" tyo run over a
(naturally) sorted array of 32 bit page pointers and only when the
page is found the indexes in this array are used to look up the
in-page bitmaps.
This allows the heavier bsearch activity to run on smaller range of
memory, hopefully reducing the cache trashing.
There are opportunities to optimise this further for cash hits, buy
collecting the tids from indexes in larger patches and then
constraining the searches in the main is-deleted-bitmap to run over
sections of it, but at some point this becomes a very complex
balancing act, as the manipulation of the bits-to-check from indexes
also takes time, not to mention the need to release the index pages
and then later chase the tid pointers in case they have moved while
checking them.
I have not measured anything yet, but one of my concerns in case of
very large dead tuple collections searched by 8-way parallel bsearch
could actually get close to saturating RAM bandwidth by reading (8 x
32bits x cache-line-size) bytes from main memory every few cycles, so
we may need some inner-loop level throttling similar to current
vacuum_cost_limit for data pages.
I think that the compiler currently has to make very conservative
assumptions when generating the machine code used by the loop inside
btvacuumpage(), which calls through an opaque function pointer at
least once per loop iteration -- anything can alias, so the compiler
must be conservative.
Definitely this! The lookup function needs to be turned into an inline
function or #define as well to give the compiler maximum freedoms.
The data dependencies are hard for both the
compiler and the CPU to analyze. The cost of using a function pointer
compared to a direct function call is usually quite low, but there are
important exceptions -- cases where it prevents other useful
optimizations. Maybe this is an exception.
Yes. Also this could be a place where unrolling the loop could make a
real difference.
Maybe not unrolling the full 32 loops for 32 bit bserach, but
something like 8-loop unroll for getting most of the benefit.
The 32x unroll would not be really that bad for performance if all 32
loops were needed, but mostly we would need to jump into last 10 to 20
loops for lookup min 1000 to 1000000 pages and I suspect this is such
a weird corner case that compiler is really unlikely to have this
optimisation supported. Of course I may be wrong and ith is a common
enough case for the optimiser.
I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.
While it may make sense to have different bitmap encodings for
different distributions, it likely would not be good for optimisations
if all these are used at the same time.
This is why I propose the first bitmap collecting phase to collect
into a file and then - when reading into memory for lookups phase -
possibly rewrite the initial structure to something else if it sees
that it is more efficient. Like for example where the first half of
the file consists of only empty pages.
This approach would make btbulkdelete() similar to
_bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
an independent idea to your ideas -- I imagine that this would work
far better when combined with a more compact data structure, which is
naturally more capable of batch processing than a simple array of
TIDs. Maybe this will help the compiler and the CPU to fully
understand the *natural* data dependencies, so that they can be as
effective as possible in making the code run fast. It's possible that
a modern CPU will be able to *hide* the latency more intelligently
than what we have today. The latency is such a big problem that we may
be able to justify "wasting" other CPU resources, just because it
sometimes helps with hiding the latency. For example, it might
actually be okay to sort all of the TIDs on the page to make the bulk
processing work
Then again it may be so much extra work that it starts to dominate
some parts of profiles.
For example see the work that was done in improving the mini-vacuum
part where it was actually faster to copy data out to a separate
buffer and then back in than shuffle it around inside the same 8k page
:)
So only testing will tell.
-- though you might still do a precheck that is
similar to the precheck inside lazy_tid_reaped() that was added by you
in commit bbaf315309e.Of course it's very easy to be wrong about stuff like this. But it
might not be that hard to prototype. You can literally copy and paste
code from _bt_delitems_delete_check() to do this. It does the same
basic thing already.
Also a lot of testing would be needed to figure out which strategy
fits best for which distribution of dead tuples, and possibly their
relation to the order of tuples to check from indexes .
Cheers
--
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.
On Thu, Jul 8, 2021 at 1:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
As I wrote in the first email, I think there are two important factors
in index vacuuming performance: the performance to check if heap TID
that an index tuple points to is dead, and the number of times to
perform index bulk-deletion. The flame graph I attached in the first
mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a
disk-intensive operation in practice.
Maybe. But I recently bought an NVME SSD that can read at over
6GB/second. So "disk-intensive" is not what it used to be -- at least
not for reads. In general it's not good if we do multiple scans of an
index -- no question. But there is a danger in paying a little too
much attention to what is true in general -- we should not ignore what
might be true in specific cases either. Maybe we can solve some
problems by spilling the TID data structure to disk -- if we trade
sequential I/O for random I/O, we may be able to do only one pass over
the index (especially when we have *almost* enough memory to fit all
TIDs, but not quite enough).
The big problem with multiple passes over the index is not the extra
read bandwidth -- it's the extra page dirtying (writes), especially
with things like indexes on UUID columns. We want to dirty each leaf
page in each index at most once per VACUUM, and should be willing to
pay some cost in order to get a larger benefit with page dirtying.
After all, writes are much more expensive on modern flash devices --
if we have to do more random read I/O to spill the TIDs then that
might actually be 100% worth it. And, we don't need much memory for
something that works well as a negative cache, either -- so maybe the
extra random read I/O needed to spill the TIDs will be very limited
anyway.
There are many possibilities. You can probably think of other
trade-offs yourself. We could maybe use a cost model for all this --
it is a little like a hash join IMV. This is just something to think
about while refining the design.
Interesting idea. I remember you mentioned this idea somewhere and
I've considered this idea too while implementing the PoC code. It's
definitely worth trying. Maybe we can write a patch for this as a
separate patch? It will change index AM and could improve also the
current bulk-deletion. We can consider a better data structure on top
of this idea.
I'm happy to write it as a separate patch, either by leaving it to you
or by collaborating directly. It's not necessary to tie it to the
first patch. But at the same time it is highly related to what you're
already doing.
As I said I am totally prepared to be wrong here. But it seems worth
it to try. In Postgres 14, the _bt_delitems_vacuum() function (which
actually carries out VACUUM's physical page modifications to a leaf
page) is almost identical to _bt_delitems_delete(). And
_bt_delitems_delete() was already built with these kinds of problems
in mind -- it batches work to get the most out of synchronizing with
distant state describing which tuples to delete. It's not exactly the
same situation, but it's *kinda* similar. More importantly, it's a
relatively cheap and easy experiment to run, since we already have
most of what we need (we can take it from
_bt_delitems_delete_check()).
Usually this kind of micro optimization is not very valuable -- 99.9%+
of all code just isn't that sensitive to having the right
optimizations. But this is one of the rare important cases where we
really should look at the raw machine code, and do some kind of
microarchitectural level analysis through careful profiling, using
tools like perf. The laws of physics (or electronic engineering) make
it inevitable that searching for TIDs to match is going to be kind of
slow. But we should at least make sure that we use every trick
available to us to reduce the bottleneck, since it really does matter
a lot to users. Users should be able to expect that this code will at
least be as fast as the hardware that they paid for can allow (or
close to it). There is a great deal of microarchitectural
sophistication with modern CPUs, much of which is designed to make
problems like this one less bad [1]https://www.agner.org/optimize/microarchitecture.pdf -- Peter Geoghegan.
[1]: https://www.agner.org/optimize/microarchitecture.pdf -- Peter Geoghegan
--
Peter Geoghegan
On Thu, Jul 8, 2021 at 1:53 PM Hannu Krosing <hannuk@google.com> wrote:
How I am approaching this is separating "page search" tyo run over a
(naturally) sorted array of 32 bit page pointers and only when the
page is found the indexes in this array are used to look up the
in-page bitmaps.
This allows the heavier bsearch activity to run on smaller range of
memory, hopefully reducing the cache trashing.
I think that the really important thing is to figure out roughly the
right data structure first.
There are opportunities to optimise this further for cash hits, buy
collecting the tids from indexes in larger patches and then
constraining the searches in the main is-deleted-bitmap to run over
sections of it, but at some point this becomes a very complex
balancing act, as the manipulation of the bits-to-check from indexes
also takes time, not to mention the need to release the index pages
and then later chase the tid pointers in case they have moved while
checking them.
I would say that 200 TIDs per leaf page is common and ~1350 TIDs per
leaf page is not uncommon (with deduplication). Seems like that might
be enough?
I have not measured anything yet, but one of my concerns in case of
very large dead tuple collections searched by 8-way parallel bsearch
could actually get close to saturating RAM bandwidth by reading (8 x
32bits x cache-line-size) bytes from main memory every few cycles, so
we may need some inner-loop level throttling similar to current
vacuum_cost_limit for data pages.
If it happens then it'll be a nice problem to have, I suppose.
Maybe not unrolling the full 32 loops for 32 bit bserach, but
something like 8-loop unroll for getting most of the benefit.
My current assumption is that we're bound by memory speed right now,
and that that is the big bottleneck to eliminate -- we must keep the
CPU busy with data to process first. That seems like the most
promising thing to focus on right now.
While it may make sense to have different bitmap encodings for
different distributions, it likely would not be good for optimisations
if all these are used at the same time.
To some degree designs like Roaring bitmaps are just that -- a way of
dynamically figuring out which strategy to use based on data
characteristics.
This is why I propose the first bitmap collecting phase to collect
into a file and then - when reading into memory for lookups phase -
possibly rewrite the initial structure to something else if it sees
that it is more efficient. Like for example where the first half of
the file consists of only empty pages.
Yeah, I agree that something like that could make sense. Although
rewriting it doesn't seem particularly promising, since we can easily
make it cheap to process any TID that falls into a range of blocks
that have no dead tuples. We don't need to rewrite the data structure
to make it do that well, AFAICT.
When I said that I thought of this a little like a hash join, I was
being more serious than you might imagine. Note that the number of
index tuples that VACUUM will delete from each index can now be far
less than the total number of TIDs stored in memory. So even when we
have (say) 20% of all of the TIDs from the table in our in memory list
managed by vacuumlazy.c, it's now quite possible that VACUUM will only
actually "match"/"join" (i.e. delete) as few as 2% of the index tuples
it finds in the index (there really is no way to predict how many).
The opportunistic deletion stuff could easily be doing most of the
required cleanup in an eager fashion following recent improvements --
VACUUM need only take care of "floating garbage" these days. In other
words, thinking about this as something that is a little bit like a
hash join makes sense because hash joins do very well with high join
selectivity, and high join selectivity is common in the real world.
The intersection of TIDs from each leaf page with the in-memory TID
delete structure will often be very small indeed.
Then again it may be so much extra work that it starts to dominate
some parts of profiles.For example see the work that was done in improving the mini-vacuum
part where it was actually faster to copy data out to a separate
buffer and then back in than shuffle it around inside the same 8k page
Some of what I'm saying is based on the experience of improving
similar code used by index tuple deletion in Postgres 14. That did
quite a lot of sorting of TIDs and things like that. In the end the
sorting had no more than a negligible impact on performance. What
really mattered was that we efficiently coordinate with distant heap
pages that describe which index tuples we can delete from a given leaf
page. Sorting hundreds of TIDs is cheap. Reading hundreds of random
locations in memory (or even far fewer) is not so cheap. It might even
be very slow indeed. Sorting in order to batch could end up looking
like cheap insurance that we should be glad to pay for.
So only testing will tell.
True.
--
Peter Geoghegan
On Fri, Jul 9, 2021 at 12:34 AM Peter Geoghegan <pg@bowt.ie> wrote:
...
I would say that 200 TIDs per leaf page is common and ~1350 TIDs per
leaf page is not uncommon (with deduplication). Seems like that might
be enough?
Likely yes, and also it would have the nice property of not changing
the index page locking behaviour.
Are deduplicated tids in the leaf page already sorted in heap order ?
This could potentially simplify / speed up the sort.
I have not measured anything yet, but one of my concerns in case of
very large dead tuple collections searched by 8-way parallel bsearch
could actually get close to saturating RAM bandwidth by reading (8 x
32bits x cache-line-size) bytes from main memory every few cycles, so
we may need some inner-loop level throttling similar to current
vacuum_cost_limit for data pages.If it happens then it'll be a nice problem to have, I suppose.
Maybe not unrolling the full 32 loops for 32 bit bserach, but
something like 8-loop unroll for getting most of the benefit.My current assumption is that we're bound by memory speed right now,
Most likely yes, and this should be also easy to check with manually
unrolling perhaps 4 loops and measuring any speed increase.
and that that is the big bottleneck to eliminate -- we must keep the
CPU busy with data to process first. That seems like the most
promising thing to focus on right now.
This has actually two parts
- trying to make sure that we can make as much as possible from cache
- if we need to get out of cache then try to parallelise this as
much as possible
at the same time we need to watch that we are not making the index
tuple preparation work so heavy that it starts to dominate over memory
access
While it may make sense to have different bitmap encodings for
different distributions, it likely would not be good for optimisations
if all these are used at the same time.To some degree designs like Roaring bitmaps are just that -- a way of
dynamically figuring out which strategy to use based on data
characteristics.
it is, but as I am keeping one eye open for vectorisation, I don't
like when different parts of the same bitmap have radically different
encoding strategies.
This is why I propose the first bitmap collecting phase to collect
into a file and then - when reading into memory for lookups phase -
possibly rewrite the initial structure to something else if it sees
that it is more efficient. Like for example where the first half of
the file consists of only empty pages.Yeah, I agree that something like that could make sense. Although
rewriting it doesn't seem particularly promising,
yeah, I hope to prove (or verify :) ) the structure is good enough so
that it does not need the rewrite.
since we can easily
make it cheap to process any TID that falls into a range of blocks
that have no dead tuples.
I actually meant the opposite case, where we could replace a full 80
bytes 291-bit "all dead" bitmap with just a range - int4 for page and
two int2-s for min and max tid-in page for extra 10x reduction, on top
of original 21x reduction from current 6 bytes / bit encoding to my
page_bsearch_vector bitmaps which encodes one page to maximum of 80
bytes (5 x int4 sub-page pointers + 5 x int4 bitmaps).
I also started out by investigating RoaringBitmaps, but when I
realized that we will likely have to rewrite it anyway I continued
working on getting to a single uniform encoding which fits most use
cases Good Enough and then use that uniformity to enable the compiler
to do its optimisation and hopefully also vectoriziation magic.
We don't need to rewrite the data structure
to make it do that well, AFAICT.When I said that I thought of this a little like a hash join, I was
being more serious than you might imagine. Note that the number of
index tuples that VACUUM will delete from each index can now be far
less than the total number of TIDs stored in memory. So even when we
have (say) 20% of all of the TIDs from the table in our in memory list
managed by vacuumlazy.c, it's now quite possible that VACUUM will only
actually "match"/"join" (i.e. delete) as few as 2% of the index tuples
it finds in the index (there really is no way to predict how many).
The opportunistic deletion stuff could easily be doing most of the
required cleanup in an eager fashion following recent improvements --
VACUUM need only take care of "floating garbage" these days.
Ok, this points to the need to mainly optimise for quite sparse
population of dead tuples, which is still mainly clustered page-wise ?
In other
words, thinking about this as something that is a little bit like a
hash join makes sense because hash joins do very well with high join
selectivity, and high join selectivity is common in the real world.
The intersection of TIDs from each leaf page with the in-memory TID
delete structure will often be very small indeed.
The hard to optimize case is still when we have dead tuple counts in
hundreds of millions, or even billions, like on a HTAP database after
a few hours of OLAP query have accumulated loads of dead tuples in
tables getting heavy OLTP traffic.
There of course we could do a totally different optimisation, where we
also allow reaping tuples newer than the OLAP queries snapshot if we
can prove that when the snapshot moves forward next time, it has to
jump over said transactions making them indeed DEAD and not RECENTLY
DEAD. Currently we let a single OLAP query ruin everything :)
Then again it may be so much extra work that it starts to dominate
some parts of profiles.For example see the work that was done in improving the mini-vacuum
part where it was actually faster to copy data out to a separate
buffer and then back in than shuffle it around inside the same 8k pageSome of what I'm saying is based on the experience of improving
similar code used by index tuple deletion in Postgres 14. That did
quite a lot of sorting of TIDs and things like that. In the end the
sorting had no more than a negligible impact on performance.
Good to know :)
What
really mattered was that we efficiently coordinate with distant heap
pages that describe which index tuples we can delete from a given leaf
page. Sorting hundreds of TIDs is cheap. Reading hundreds of random
locations in memory (or even far fewer) is not so cheap. It might even
be very slow indeed. Sorting in order to batch could end up looking
like cheap insurance that we should be glad to pay for.
If the most expensive operation is sorting a few hundred of tids, then
this should be fast enough.
My worries were more that after the sorting we can not to dsimple
index lookups for them, but each needs to be found via bseach or maybe
even just search if that is faster under some size limit, and that
these could add up. Or some other needed thing that also has to be
done, like allocating extra memory or moving other data around in a
way that CPU does not like.
Cheers
-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.
Hi,
On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].2. Allocate the whole memory space at once.
3. Slow lookup performance (O(logN)).
I’ve done some experiments in this area and would like to share the
results and discuss ideas.
Yea, this is a serious issue.
3) could possibly be addressed to a decent degree without changing the
fundamental datastructure too much. There's some sizable and trivial
wins by just changing vac_cmp_itemptr() to compare int64s and by using
an open coded bsearch().
The big problem with bsearch isn't imo the O(log(n)) complexity - it's
that it has an abominally bad cache locality. And that can be addressed
https://arxiv.org/ftp/arxiv/papers/1509/1509.05053.pdf
Imo 2) isn't really that a hard problem to improve, even if we were to
stay with the current bsearch approach. Reallocation with an aggressive
growth factor or such isn't that bad.
That's not to say we ought to stay with binary search...
Problems Solutions
===============Firstly, I've considered using existing data structures:
IntegerSet(src/backend/lib/integerset.c) and
TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but
only either point 2 or 3. IntegerSet uses lower memory thanks to
simple-8b encoding but is slow at lookup, still O(logN), since it’s a
tree structure. On the other hand, TIDBitmap has a good lookup
performance, O(1), but could unnecessarily use larger memory in some
cases since it always allocates the space for bitmap enough to store
all possible offsets. With 8kB blocks, the maximum number of line
pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the
bitmap is 40 bytes long and we always need 46 bytes in total per block
including other meta information.
Imo tidbitmap isn't particularly good, even in the current use cases -
it's constraining in what we can store (a problem for other AMs), not
actually that dense, the lossy mode doesn't choose what information to
loose well etc.
It'd be nice if we came up with a datastructure that could also replace
the bitmap scan cases.
The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.
Not a huge fan of encoding this much knowledge about the tid layout...
For example, if there are two dead tuples at offset 1 and 150, it uses
the array container that has an array of two 2-byte integers
representing 1 and 150, using 4 bytes in total. If we used the bitmap
container in this case, we would need 20 bytes instead. On the other
hand, if there are consecutive 20 dead tuples from offset 1 to 20, it
uses the run container that has an array of 2-byte integers. The first
value in each pair represents a starting offset number, whereas the
second value represents its length. Therefore, in this case, the run
container uses only 4 bytes in total. Finally, if there are dead
tuples at every other offset from 1 to 100, it uses the bitmap
container that has an uncompressed bitmap, using 13 bytes. We need
another 16 bytes per block entry for hash table entry.The lookup complexity of a bitmap container is O(1) whereas the one of
an array and a run container is O(N) or O(logN) but the number of
elements in those two containers should not be large it would not be a
problem.
Hm. Why is O(N) not an issue? Consider e.g. the case of a table in which
many tuples have been deleted. In cases where the "run" storage is
cheaper (e.g. because there's high offset numbers due to HOT pruning),
we could end up regularly scanning a few hundred entries for a
match. That's not cheap anymore.
Evaluation
========Before implementing this idea and integrating it with lazy vacuum
code, I've implemented a benchmark tool dedicated to evaluating
lazy_tid_reaped() performance[4].
Good idea!
In all test cases, I simulated that the table has 1,000,000 blocks and
every block has at least one dead tuple.
That doesn't strike me as a particularly common scenario? I think it's
quite rare for there to be so evenly but sparse dead tuples. In
particularly it's very common for there to be long runs of dead tuples
separated by long ranges of no dead tuples at all...
The benchmark scenario is that for
each virtual heap tuple we check if there is its TID in the dead
tuple storage. Here are the results of execution time in milliseconds
and memory usage in bytes:
In which order are the dead tuples checked? Looks like in sequential
order? In the case of an index over a column that's not correlated with
the heap order the lookups are often much more random - which can
influence lookup performance drastically, due to cache differences in
cache locality. Which will make some structures look worse/better than
others.
Greetings,
Andres Freund
Hi,
On 2021-07-08 20:53:32 -0700, Andres Freund wrote:
On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].2. Allocate the whole memory space at once.
3. Slow lookup performance (O(logN)).
I’ve done some experiments in this area and would like to share the
results and discuss ideas.Yea, this is a serious issue.
3) could possibly be addressed to a decent degree without changing the
fundamental datastructure too much. There's some sizable and trivial
wins by just changing vac_cmp_itemptr() to compare int64s and by using
an open coded bsearch().
Just using itemptr_encode() makes array in test #1 go from 8s to 6.5s on my
machine.
Another thing I just noticed is that you didn't include the build times for the
datastructures. They are lower than the lookups currently, but it does seem
like a relevant thing to measure as well. E.g. for #1 I see the following build
times
array 24.943 ms
tbm 206.456 ms
intset 93.575 ms
vtbm 134.315 ms
rtbm 145.964 ms
that's a significant range...
Randomizing the lookup order (using a random shuffle in
generate_index_tuples()) changes the benchmark results for #1 significantly:
shuffled time unshuffled time
array 6551.726 ms 6478.554 ms
intset 67590.879 ms 10815.810 ms
rtbm 17992.487 ms 2518.492 ms
tbm 364.917 ms 360.128 ms
vtbm 12227.884 ms 1288.123 ms
FWIW, I get an assertion failure when using an assertion build:
#2 0x0000561800ea02e0 in ExceptionalCondition (conditionName=0x7f9115a88e91 "found", errorType=0x7f9115a88d11 "FailedAssertion",
fileName=0x7f9115a88e8a "rtbm.c", lineNumber=242) at /home/andres/src/postgresql/src/backend/utils/error/assert.c:69
#3 0x00007f9115a87645 in rtbm_add_tuples (rtbm=0x561806293280, blkno=0, offnums=0x7fffdccabb00, nitems=10) at rtbm.c:242
#4 0x00007f9115a8363d in load_rtbm (rtbm=0x561806293280, itemptrs=0x7f908a203050, nitems=10000000) at bdbench.c:618
#5 0x00007f9115a834b9 in rtbm_attach (lvtt=0x7f9115a8c300 <LVTestSubjects+352>, nitems=10000000, minblk=2139062143, maxblk=2139062143, maxoff=32639)
at bdbench.c:587
#6 0x00007f9115a83837 in attach (lvtt=0x7f9115a8c300 <LVTestSubjects+352>, nitems=10000000, minblk=2139062143, maxblk=2139062143, maxoff=32639)
at bdbench.c:658
#7 0x00007f9115a84190 in attach_dead_tuples (fcinfo=0x56180322d690) at bdbench.c:873
I assume you just inverted the Assert(found) assertion?
Greetings,
Andres Freund
On Fri, Jul 9, 2021 at 12:53 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].2. Allocate the whole memory space at once.
3. Slow lookup performance (O(logN)).
I’ve done some experiments in this area and would like to share the
results and discuss ideas.Yea, this is a serious issue.
3) could possibly be addressed to a decent degree without changing the
fundamental datastructure too much. There's some sizable and trivial
wins by just changing vac_cmp_itemptr() to compare int64s and by using
an open coded bsearch().The big problem with bsearch isn't imo the O(log(n)) complexity - it's
that it has an abominally bad cache locality. And that can be addressed
https://arxiv.org/ftp/arxiv/papers/1509/1509.05053.pdfImo 2) isn't really that a hard problem to improve, even if we were to
stay with the current bsearch approach. Reallocation with an aggressive
growth factor or such isn't that bad.That's not to say we ought to stay with binary search...
Problems Solutions
===============Firstly, I've considered using existing data structures:
IntegerSet(src/backend/lib/integerset.c) and
TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but
only either point 2 or 3. IntegerSet uses lower memory thanks to
simple-8b encoding but is slow at lookup, still O(logN), since it’s a
tree structure. On the other hand, TIDBitmap has a good lookup
performance, O(1), but could unnecessarily use larger memory in some
cases since it always allocates the space for bitmap enough to store
all possible offsets. With 8kB blocks, the maximum number of line
pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the
bitmap is 40 bytes long and we always need 46 bytes in total per block
including other meta information.Imo tidbitmap isn't particularly good, even in the current use cases -
it's constraining in what we can store (a problem for other AMs), not
actually that dense, the lossy mode doesn't choose what information to
loose well etc.It'd be nice if we came up with a datastructure that could also replace
the bitmap scan cases.
Agreed.
The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.Not a huge fan of encoding this much knowledge about the tid layout...
For example, if there are two dead tuples at offset 1 and 150, it uses
the array container that has an array of two 2-byte integers
representing 1 and 150, using 4 bytes in total. If we used the bitmap
container in this case, we would need 20 bytes instead. On the other
hand, if there are consecutive 20 dead tuples from offset 1 to 20, it
uses the run container that has an array of 2-byte integers. The first
value in each pair represents a starting offset number, whereas the
second value represents its length. Therefore, in this case, the run
container uses only 4 bytes in total. Finally, if there are dead
tuples at every other offset from 1 to 100, it uses the bitmap
container that has an uncompressed bitmap, using 13 bytes. We need
another 16 bytes per block entry for hash table entry.The lookup complexity of a bitmap container is O(1) whereas the one of
an array and a run container is O(N) or O(logN) but the number of
elements in those two containers should not be large it would not be a
problem.Hm. Why is O(N) not an issue? Consider e.g. the case of a table in which
many tuples have been deleted. In cases where the "run" storage is
cheaper (e.g. because there's high offset numbers due to HOT pruning),
we could end up regularly scanning a few hundred entries for a
match. That's not cheap anymore.
With 8kB blocks, the maximum size of a bitmap container is 37 bytes.
IOW, other two types of containers are always smaller than 37 bytes.
Since the run container uses 4 bytes per run, the number of runs in a
run container never be more than 9. Even with 32kB blocks, we don’t
have more than 37 runs. So I think N is small enough in this case.
Evaluation
========Before implementing this idea and integrating it with lazy vacuum
code, I've implemented a benchmark tool dedicated to evaluating
lazy_tid_reaped() performance[4].Good idea!
In all test cases, I simulated that the table has 1,000,000 blocks and
every block has at least one dead tuple.That doesn't strike me as a particularly common scenario? I think it's
quite rare for there to be so evenly but sparse dead tuples. In
particularly it's very common for there to be long runs of dead tuples
separated by long ranges of no dead tuples at all...
Agreed. I'll test with such scenarios.
The benchmark scenario is that for
each virtual heap tuple we check if there is its TID in the dead
tuple storage. Here are the results of execution time in milliseconds
and memory usage in bytes:In which order are the dead tuples checked? Looks like in sequential
order? In the case of an index over a column that's not correlated with
the heap order the lookups are often much more random - which can
influence lookup performance drastically, due to cache differences in
cache locality. Which will make some structures look worse/better than
others.
Good point. It's sequential order, which is not good. I'll test again
after shuffling virtual index tuples.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Fri, Jul 9, 2021 at 2:37 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-07-08 20:53:32 -0700, Andres Freund wrote:
On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].2. Allocate the whole memory space at once.
3. Slow lookup performance (O(logN)).
I’ve done some experiments in this area and would like to share the
results and discuss ideas.Yea, this is a serious issue.
3) could possibly be addressed to a decent degree without changing the
fundamental datastructure too much. There's some sizable and trivial
wins by just changing vac_cmp_itemptr() to compare int64s and by using
an open coded bsearch().Just using itemptr_encode() makes array in test #1 go from 8s to 6.5s on my
machine.Another thing I just noticed is that you didn't include the build times for the
datastructures. They are lower than the lookups currently, but it does seem
like a relevant thing to measure as well. E.g. for #1 I see the following build
timesarray 24.943 ms
tbm 206.456 ms
intset 93.575 ms
vtbm 134.315 ms
rtbm 145.964 msthat's a significant range...
Good point. I got similar results when measuring on my machine:
array 57.987 ms
tbm 297.720 ms
intset 113.796 ms
vtbm 165.268 ms
rtbm 199.658 ms
Randomizing the lookup order (using a random shuffle in
generate_index_tuples()) changes the benchmark results for #1 significantly:shuffled time unshuffled time
array 6551.726 ms 6478.554 ms
intset 67590.879 ms 10815.810 ms
rtbm 17992.487 ms 2518.492 ms
tbm 364.917 ms 360.128 ms
vtbm 12227.884 ms 1288.123 ms
I believe that in your test, tbm_reaped() actually always returned
true. That could explain tbm was very fast in both cases. Since
TIDBitmap in the core doesn't support the existence check tbm_reaped()
in bdbench.c always returns true. I added a patch in the repository to
add existence check support to TIDBitmap, although it assumes bitmap
never be lossy.
That being said, I'm surprised that rtbm is slower than array even in
the unshuffled case. I've also measured the shuffle cases and got
different results. To be clear, I used prepare() SQL function to
prepare both virtual dead tuples and index tuples, load them by
attach_dead_tuples() SQL function, and executed bench() SQL function
for each data structure. Here are the results:
shuffled time unshuffled time
array 88899.513 ms 12616.521 ms
intset 73476.055 ms 10063.405 ms
rtbm 22264.671 ms 2073.171 ms
tbm 10285.092 ms 1417.312 ms
vtbm 14488.581 ms 1240.666 ms
FWIW, I get an assertion failure when using an assertion build:
#2 0x0000561800ea02e0 in ExceptionalCondition (conditionName=0x7f9115a88e91 "found", errorType=0x7f9115a88d11 "FailedAssertion",
fileName=0x7f9115a88e8a "rtbm.c", lineNumber=242) at /home/andres/src/postgresql/src/backend/utils/error/assert.c:69
#3 0x00007f9115a87645 in rtbm_add_tuples (rtbm=0x561806293280, blkno=0, offnums=0x7fffdccabb00, nitems=10) at rtbm.c:242
#4 0x00007f9115a8363d in load_rtbm (rtbm=0x561806293280, itemptrs=0x7f908a203050, nitems=10000000) at bdbench.c:618
#5 0x00007f9115a834b9 in rtbm_attach (lvtt=0x7f9115a8c300 <LVTestSubjects+352>, nitems=10000000, minblk=2139062143, maxblk=2139062143, maxoff=32639)
at bdbench.c:587
#6 0x00007f9115a83837 in attach (lvtt=0x7f9115a8c300 <LVTestSubjects+352>, nitems=10000000, minblk=2139062143, maxblk=2139062143, maxoff=32639)
at bdbench.c:658
#7 0x00007f9115a84190 in attach_dead_tuples (fcinfo=0x56180322d690) at bdbench.c:873I assume you just inverted the Assert(found) assertion?
Right. Fixed it.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Thu, Jul 8, 2021 at 7:51 AM Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Jul 7, 2021 at 1:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.Maybe for something like rtbm.c (which is inspired by Roaring
bitmaps), you would really want to use an "intersection" operation for
this. The TIDs that we need to physically delete from the leaf page
inside btvacuumpage() are the intersection of two bitmaps: our bitmap
of all TIDs on the leaf page, and our bitmap of all TIDs that need to
be deleting by the ongoing btbulkdelete() call.
Agreed. In such a batch operation, what we need to do here is to
compute the intersection of two bitmaps.
Obviously the typical case is that most TIDs in the index do *not* get
deleted -- needing to delete more than ~20% of all TIDs in the index
will be rare. Ideally it would be very cheap to figure out that a TID
does not need to be deleted at all. Something a little like a negative
cache (but not a true negative cache). This is a little bit like how
hash joins can be made faster by adding a Bloom filter -- most hash
probes don't need to join a tuple in the real world, and we can make
these hash probes even faster by using a Bloom filter as a negative
cache.
Agreed.
If you had the list of TIDs from a leaf page sorted for batch
processing, and if you had roaring bitmap style "chunks" with
"container" metadata stored in the data structure, you could then use
merging/intersection -- that has some of the same advantages. I think
that this would be a lot more efficient than having one binary search
per TID. Most TIDs from the leaf page can be skipped over very
quickly, in large groups. It's very rare for VACUUM to need to delete
TIDs from completely random heap table blocks in the real world (some
kind of pattern is much more common).When this merging process finds 1 TID that might really be deletable
then it's probably going to find much more than 1 -- better to make
that cache miss take care of all of the TIDs together. Also seems like
the CPU could do some clever prefetching with this approach -- it
could prefetch TIDs where the initial chunk metadata is insufficient
to eliminate them early -- these are the groups of TIDs that will have
many TIDs that we actually need to delete. ISTM that improving
temporal locality through batching could matter a lot here.
That's a promising approach.
In rtbm, the pair of one hash entry and one container is used per
block. Therefore, we can skip TID from the leaf page by checking the
hash table, if there is no dead tuple in the block. If there is the
hash entry, since it means the block has at least one dead tuple, we
can look for the offset of TID from the leaf page from the container.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Thu, Jul 8, 2021 at 10:40 PM Hannu Krosing <hannuk@google.com> wrote:
Very nice results.
I have been working on the same problem but a bit different solution -
a mix of binary search for (sub)pages and 32-bit bitmaps for
tid-in-page.Even with currebnt allocation heuristics (allocate 291 tids per page)
it initially allocate much less space, instead of current 291*6=1746
bytes per page it needs to allocate 80 bytes.Also it can be laid out so that it is friendly to parallel SIMD
searches doing up to 8 tid lookups in parallel.
Interesting.
That said, for allocating the tid array, the best solution is to
postpone it as much as possible and to do the initial collection into
a file, which1) postpones the memory allocation to the beginning of index cleanups
2) lets you select the correct size and structure as you know more
about the distribution at that time3) do the first heap pass in one go and then advance frozenxmin
*before* index cleanup
I think we have to do index vacuuming before heap vacuuming (2nd heap
pass). So do you mean that it advances relfrozenxid of pg_class before
both index vacuuming and heap vacuuming?
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Hi,
On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:
So I prototyped a new data structure dedicated to storing dead tuples
during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
The authors provide an implementation of Roaring Bitmap[3] (Apache
2.0 license). But I've implemented this idea from scratch because we
need to integrate it with Dynamic Shared Memory/Area to support
parallel vacuum and need to support ItemPointerData, 6-bytes integer
in total, whereas the implementation supports only 4-bytes integers.
Also, when it comes to vacuum, we neither need to compute the
intersection, the union, nor the difference between sets, but need
only an existence check.The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.
How are you thinking of implementing iteration efficiently for rtbm? The
second heap pass needs that obviously... I think the only option would
be to qsort the whole thing?
Greetings,
Andres Freund
Hi,
On 2021-07-09 10:17:49 -0700, Andres Freund wrote:
On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:So I prototyped a new data structure dedicated to storing dead tuples
during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
The authors provide an implementation of Roaring Bitmap[3] (Apache
2.0 license). But I've implemented this idea from scratch because we
need to integrate it with Dynamic Shared Memory/Area to support
parallel vacuum and need to support ItemPointerData, 6-bytes integer
in total, whereas the implementation supports only 4-bytes integers.
Also, when it comes to vacuum, we neither need to compute the
intersection, the union, nor the difference between sets, but need
only an existence check.The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.How are you thinking of implementing iteration efficiently for rtbm? The
second heap pass needs that obviously... I think the only option would
be to qsort the whole thing?
I experimented further, trying to use an old radix tree implementation I
had lying around to store dead tuples. With a bit of trickery that seems
to work well.
The radix tree implementation I have basically maps an int64 to another
int64. Each level of the radix tree stores 6 bits of the key, and uses
those 6 bits to index a 1<<64 long array leading to the next level.
My first idea was to use itemptr_encode() to convert tids into an int64
and store the lower 6 bits in the value part of the radix tree. That
turned out to work well performance wise, but awfully memory usage
wise. The problem is that we at most use 9 bits for offsets, but reserve
16 bits for it in the ItemPointerData. Which means that there's often a
lot of empty "tree levels" for those 0 bits, making it hard to get to a
decent memory usage.
The simplest way to address that was to simply compress out those
guaranteed-to-be-zero bits. That results in memory usage that's quite
good - nearly always beating array, occasionally beating rtbm. It's an
ordered datastructure, so the latter isn't too surprising. For lookup
performance the radix approach is commonly among the best, if not the
best.
A variation of the storage approach is to just use the block number as
the index, and store the tids as the value. Even with the absolutely
naive approach of just using a Bitmapset that reduces memory usage
substantially - at a small cost to search performance. Of course it'd be
better to use an adaptive approach like you did for rtbm, I just thought
this is good enough.
This largely works well, except when there are a large number of evenly
spread out dead tuples. I don't think that's a particularly common
situation, but it's worth considering anyway.
The reason the memory usage can be larger for sparse workloads obviously
can lead to tree nodes with only one child. As they are quite large
(1<<6 pointers to further children) that then can lead to large increase
in memory usage.
I have toyed with implementing adaptively large radix nodes like
proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
gotten it quite working.
Greetings,
Andres Freund
On Sat, Jul 10, 2021 at 2:17 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:So I prototyped a new data structure dedicated to storing dead tuples
during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
The authors provide an implementation of Roaring Bitmap[3] (Apache
2.0 license). But I've implemented this idea from scratch because we
need to integrate it with Dynamic Shared Memory/Area to support
parallel vacuum and need to support ItemPointerData, 6-bytes integer
in total, whereas the implementation supports only 4-bytes integers.
Also, when it comes to vacuum, we neither need to compute the
intersection, the union, nor the difference between sets, but need
only an existence check.The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.How are you thinking of implementing iteration efficiently for rtbm? The
second heap pass needs that obviously... I think the only option would
be to qsort the whole thing?
Yes, I'm thinking that the iteration of rtbm is somewhat similar to
tbm. That is, we iterate and collect hash table entries and do qsort
hash entries by the block number. Then fetch the entry along with its
container one by one in order of the block number.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Sorry for the late reply.
On Sat, Jul 10, 2021 at 11:55 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-07-09 10:17:49 -0700, Andres Freund wrote:
On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:So I prototyped a new data structure dedicated to storing dead tuples
during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
The authors provide an implementation of Roaring Bitmap[3] (Apache
2.0 license). But I've implemented this idea from scratch because we
need to integrate it with Dynamic Shared Memory/Area to support
parallel vacuum and need to support ItemPointerData, 6-bytes integer
in total, whereas the implementation supports only 4-bytes integers.
Also, when it comes to vacuum, we neither need to compute the
intersection, the union, nor the difference between sets, but need
only an existence check.The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.How are you thinking of implementing iteration efficiently for rtbm? The
second heap pass needs that obviously... I think the only option would
be to qsort the whole thing?I experimented further, trying to use an old radix tree implementation I
had lying around to store dead tuples. With a bit of trickery that seems
to work well.
Thank you for experimenting with another approach.
The radix tree implementation I have basically maps an int64 to another
int64. Each level of the radix tree stores 6 bits of the key, and uses
those 6 bits to index a 1<<64 long array leading to the next level.My first idea was to use itemptr_encode() to convert tids into an int64
and store the lower 6 bits in the value part of the radix tree. That
turned out to work well performance wise, but awfully memory usage
wise. The problem is that we at most use 9 bits for offsets, but reserve
16 bits for it in the ItemPointerData. Which means that there's often a
lot of empty "tree levels" for those 0 bits, making it hard to get to a
decent memory usage.The simplest way to address that was to simply compress out those
guaranteed-to-be-zero bits. That results in memory usage that's quite
good - nearly always beating array, occasionally beating rtbm. It's an
ordered datastructure, so the latter isn't too surprising. For lookup
performance the radix approach is commonly among the best, if not the
best.
How were its both lookup performance and memory usage comparing to
intset? I guess the performance trends of those two approaches are
similar since both consists of a tree. Intset encodes uint64 by
simple-8B encoding so I'm interested also in the comparison in terms
of memory usage.
A variation of the storage approach is to just use the block number as
the index, and store the tids as the value. Even with the absolutely
naive approach of just using a Bitmapset that reduces memory usage
substantially - at a small cost to search performance. Of course it'd be
better to use an adaptive approach like you did for rtbm, I just thought
this is good enough.This largely works well, except when there are a large number of evenly
spread out dead tuples. I don't think that's a particularly common
situation, but it's worth considering anyway.The reason the memory usage can be larger for sparse workloads obviously
can lead to tree nodes with only one child. As they are quite large
(1<<6 pointers to further children) that then can lead to large increase
in memory usage.
Interesting. How big was it in such workloads comparing to other data
structures?
I personally like adaptive approaches especially in the context of
vacuum improvements. We know common patterns of dead tuple
distribution but it’s not necessarily true since it depends on data
distribution and timings of autovacuum etc even with the same
workload. And we might be able to provide a new approach that works
well in 95% of use cases but if things get worse than before in
another 5% I think the approach is not a good approach. Ideally, it
should be better in common cases and at least be the same as before in
other cases.
BTW is the implementation of the radix tree approach available
somewhere? If so I'd like to experiment with that too.
I have toyed with implementing adaptively large radix nodes like
proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
gotten it quite working.
That seems promising approach.
Regards,
[1]: /messages/by-id/CA+TgmoakKFXwUv1Cx2mspUuPQHzYF74BfJ8koF5YdgVLCvhpwA@mail.gmail.com
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Hi,
On 2021-07-19 15:20:54 +0900, Masahiko Sawada wrote:
BTW is the implementation of the radix tree approach available
somewhere? If so I'd like to experiment with that too.I have toyed with implementing adaptively large radix nodes like
proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
gotten it quite working.That seems promising approach.
I've since implemented some, but not all of the ideas of that paper
(adaptive node sizes, but not the tree compression pieces).
E.g. for
select prepare(
1000000, -- max block
20, -- # of dead tuples per page
10, -- dead tuples interval within a page
1 -- page inteval
);
attach size shuffled ordered
array 69 ms 120 MB 84.87 s 8.66 s
intset 173 ms 65 MB 68.82 s 11.75 s
rtbm 201 ms 67 MB 11.54 s 1.35 s
tbm 232 ms 100 MB 8.33 s 1.26 s
vtbm 162 ms 58 MB 10.01 s 1.22 s
radix 88 ms 42 MB 11.49 s 1.67 s
and for
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
1 -- page inteval
);
attach size shuffled ordered
array 24 ms 60MB 3.74s 1.02 s
intset 97 ms 49MB 3.14s 0.75 s
rtbm 138 ms 36MB 0.41s 0.14 s
tbm 198 ms 101MB 0.41s 0.14 s
vtbm 118 ms 27MB 0.39s 0.12 s
radix 33 ms 10MB 0.28s 0.10 s
(this is an almost unfairly good case for radix)
Running out of time to format the results of the other testcases before
I have to run, unfortunately. radix uses 42MB both in test case 3 and
4.
The radix tree code isn't good right now. A ridiculous amount of
duplication etc. The naming clearly shows its origins from a buffer
mapping radix tree...
Currently in a bunch of the cases 20% of the time is spent in
radix_reaped(). If I move that into radix.c and for bfm_lookup() to be
inlined, I get reduced overhead. rbtm for example essentially already
does that, because it does splitting of ItemPointer in rtbm.c.
I've attached my current patches against your tree.
Greetings,
Andres Freund
Attachments:
0001-Fix-build-warnings.patchtext/x-diff; charset=us-asciiDownload
From 5dfbe02000aefd3e085bdea0ec809247e1fb71b3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 19 Jul 2021 16:03:28 -0700
Subject: [PATCH 1/3] Fix build warnings.
---
bdbench/bdbench.c | 2 +-
bdbench/rtbm.c | 4 ++--
bdbench/vtbm.c | 10 +++++++---
3 files changed, 10 insertions(+), 6 deletions(-)
diff --git a/bdbench/bdbench.c b/bdbench/bdbench.c
index 800567d..1df5c53 100644
--- a/bdbench/bdbench.c
+++ b/bdbench/bdbench.c
@@ -655,7 +655,7 @@ _bench(LVTestType *lvtt)
fclose(f);
#endif
- elog(NOTICE, "\"%s\": dead tuples %lu, index tuples %lu, mathed %d, mem %zu",
+ elog(NOTICE, "\"%s\": dead tuples %lu, index tuples %lu, matched %d, mem %zu",
lvtt->name,
lvtt->dtinfo.nitems,
IndexTids_cache->dtinfo.nitems,
diff --git a/bdbench/rtbm.c b/bdbench/rtbm.c
index 025d2a9..eac277a 100644
--- a/bdbench/rtbm.c
+++ b/bdbench/rtbm.c
@@ -449,9 +449,9 @@ dump_entry(RTbm *rtbm, DtEntry *entry)
}
}
- elog(NOTICE, "%s (offset %d len %d)",
+ elog(NOTICE, "%s (offset %llu len %d)",
str.data,
- entry->offset, len);
+ (long long unsigned) entry->offset, len);
}
static int
diff --git a/bdbench/vtbm.c b/bdbench/vtbm.c
index c59d6e1..63320f5 100644
--- a/bdbench/vtbm.c
+++ b/bdbench/vtbm.c
@@ -72,7 +72,8 @@ vtbm_add_tuples(VTbm *vtbm, const BlockNumber blkno,
DtEntry *entry;
bool found;
char oldstatus;
- int wordnum, bitnum;
+ int wordnum = 0;
+ int bitnum;
entry = dttable_insert(vtbm->dttable, blkno, &found);
Assert(!found);
@@ -216,8 +217,10 @@ vtbm_dump(VTbm *vtbm)
vtbm->bitmap_size, vtbm->npages);
for (int i = 0; i < vtbm->npages; i++)
{
+ char *bitmap;
+
entry = entries[i];
- char *bitmap = &(vtbm->bitmap[entry->offset]);
+ bitmap = &(vtbm->bitmap[entry->offset]);
appendStringInfo(&str, "[%5d] : ", entry->blkno);
for (int off = 0; off < entry->len; off++)
@@ -239,6 +242,7 @@ vtbm_dump_blk(VTbm *vtbm, BlockNumber blkno)
{
DtEntry *entry;
StringInfoData str;
+ char *bitmap;
initStringInfo(&str);
@@ -252,7 +256,7 @@ vtbm_dump_blk(VTbm *vtbm, BlockNumber blkno)
return;
}
- char *bitmap = &(vtbm->bitmap[entry->offset]);
+ bitmap = &(vtbm->bitmap[entry->offset]);
appendStringInfo(&str, "[%5d] : ", entry->blkno);
for (int off = 1; off < entry->len; off++)
--
2.32.0.rc2
0002-Add-radix-tree.patchtext/x-diff; charset=us-asciiDownload
From 5ba05ffad4a9605a6fb5a24fe625542aee226ec8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 19 Jul 2021 16:04:55 -0700
Subject: [PATCH 2/3] Add radix tree.
---
bdbench/radix.c | 3088 +++++++++++++++++++++++++++++++++++++++++++++++
bdbench/radix.h | 76 ++
2 files changed, 3164 insertions(+)
create mode 100644 bdbench/radix.c
create mode 100644 bdbench/radix.h
diff --git a/bdbench/radix.c b/bdbench/radix.c
new file mode 100644
index 0000000..c7061f0
--- /dev/null
+++ b/bdbench/radix.c
@@ -0,0 +1,3088 @@
+/*
+ *
+ */
+
+#include "postgres.h"
+
+#include "radix.h"
+
+#include "lib/stringinfo.h"
+#include "port/pg_bitutils.h"
+#include "utils/memutils.h"
+
+
+/*
+ * How many bits are encoded in one tree level.
+ *
+ * Linux uses 6, ART uses 8. In a non-adaptive radix tree the disadvantage of
+ * a higher fanout is increased memory usage - but the adapative node size
+ * addresses that to a good degree. Using a common multiple of 8 (i.e. bytes
+ * in a byte) has the advantage of making it easier to eventually support
+ * variable length data. Therefore go with 8 for now.
+ */
+#define BFM_FANOUT 8
+
+#define BFM_MAX_CLASS (1<<BFM_FANOUT)
+
+#define BFM_MASK ((1 << BFM_FANOUT) - 1)
+
+
+/*
+ * Base type for all node types.
+ */
+struct bfm_tree_node_inner;
+typedef struct bfm_tree_node
+{
+ /*
+ * Size class of entry (stored as uint8 instead of bfm_tree_node_kind to
+ * save space).
+ *
+ * XXX: For efficiency in random access cases it'd be a good idea to
+ * encode the kind of a node in the pointer value of upper nodes, in the
+ * low bits. Being able to do the node type dispatch during traversal
+ * before the memory for the node has been fetched from memory would
+ * likely improve performance significantly. But that'd require at least
+ * 8 byte alignment, which we don't currently guarantee on all platforms.
+ */
+ uint8 kind;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. I.e. the key is shifted by `shift` and the lowest BFM_FANOUT bits
+ * are then represented in chunk.
+ */
+ uint8 node_shift;
+ uint8 node_chunk;
+
+ /*
+ * Number of children - currently uint16 to be able to indicate 256
+ * children at a fanout of 8.
+ */
+ uint16 count;
+
+ /* FIXME: Right now there's always unused bytes here :( */
+
+ /*
+ * FIXME: could be removed by using a stack while walking down to deleted
+ * node.
+ */
+ struct bfm_tree_node_inner *parent;
+} bfm_tree_node;
+
+/*
+ * Base type for all inner nodes.
+ */
+typedef struct bfm_tree_node_inner
+{
+ bfm_tree_node b;
+} bfm_tree_node_inner;
+
+/*
+ * Base type for all leaf nodes.
+ */
+typedef struct bfm_tree_node_leaf
+{
+ bfm_tree_node b;
+} bfm_tree_node_leaf;
+
+
+/*
+ * Size classes.
+ *
+ * To reduce memory usage compared to a simple radix tree with a fixed fanout
+ * we use adaptive node sizes, with different storage methods for different
+ * numbers of elements.
+ *
+ * FIXME: These are currently not well chosen. To reduce memory fragmentation
+ * smaller class should optimally fit neatly into the next larger class
+ * (except perhaps at the lowest end). Right now its
+ * 32->56->160->304->1296->2064/2096 bytes for inner/leaf nodes, repeatedly
+ * just above a power of 2, leading to large amounts of allocator padding with
+ * aset.c. Hence the use of slab.
+ *
+ * FIXME: Duplication.
+ *
+ * XXX: Consider implementing path compression, it reduces worst case memory
+ * usage substantially. I.e. collapse sequences of nodes with just one child
+ * into one node. That would make it feasible to use this datastructure for
+ * wide keys. Gut feeling: When compressing inner nodes a limited number of
+ * tree levels should be skippable to keep nodes of a constant size. But when
+ * collapsing to leaf nodes it likely is worth to make them variable width,
+ * it's such a common scenario (a sparse key will always end with such a chain
+ * of nodes).
+ */
+
+/*
+ * Inner node size classes.
+ */
+typedef struct bfm_tree_node_inner_1
+{
+ bfm_tree_node_inner b;
+
+ /* single child, for key chunk */
+ uint8 chunk;
+ bfm_tree_node *slot;
+} bfm_tree_node_inner_1;
+
+typedef struct bfm_tree_node_inner_4
+{
+ bfm_tree_node_inner b;
+
+ /* four children, for key chunks */
+ uint8 chunks[4];
+ bfm_tree_node *slots[4];
+} bfm_tree_node_inner_4;
+
+typedef struct bfm_tree_node_inner_16
+{
+ bfm_tree_node_inner b;
+
+ /* four children, for key chunks */
+ uint8 chunks[16];
+ bfm_tree_node *slots[16];
+} bfm_tree_node_inner_16;
+
+#define BFM_TREE_NODE_32_INVALID 0xFF
+typedef struct bfm_tree_node_inner_32
+{
+ bfm_tree_node_inner b;
+
+ /*
+ * 32 children. Offsets is indexed by they key chunk and points into
+ * ->slots. An offset of BFM_TREE_NODE_32_INVALID indicates a non-existing
+ * entry.
+ *
+ * XXX: It'd be nice to shrink the offsets array to use fewer bits - we
+ * only need to index into an array of 32 entries. But 32 offsets already
+ * is 5 bits, making a simple & fast encoding nontrivial.
+ */
+ uint8 chunks[32];
+ bfm_tree_node *slots[32];
+} bfm_tree_node_inner_32;
+
+#define BFM_TREE_NODE_128_INVALID 0xFF
+typedef struct bfm_tree_node_inner_128
+{
+ bfm_tree_node_inner b;
+
+ uint8 offsets[BFM_MAX_CLASS];
+ bfm_tree_node *slots[128];
+} bfm_tree_node_inner_128;
+
+typedef struct bfm_tree_node_inner_max
+{
+ bfm_tree_node_inner b;
+ bfm_tree_node *slots[BFM_MAX_CLASS];
+} bfm_tree_node_inner_max;
+
+
+/*
+ * Leaf node size classes.
+ *
+ * Currently these are separate from inner node size classes for two main
+ * reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ * width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+
+typedef struct bfm_tree_node_leaf_1
+{
+ bfm_tree_node_leaf b;
+ uint8 chunk;
+ bfm_value_type value;
+} bfm_tree_node_leaf_1;
+
+#define BFM_TREE_NODE_LEAF_4_INVALID 0xFFFF
+typedef struct bfm_tree_node_leaf_4
+{
+ bfm_tree_node_leaf b;
+ uint8 chunks[4];
+ bfm_value_type values[4];
+} bfm_tree_node_leaf_4;
+
+#define BFM_TREE_NODE_LEAF_16_INVALID 0xFFFF
+typedef struct bfm_tree_node_leaf_16
+{
+ bfm_tree_node_leaf b;
+ uint8 chunks[16];
+ bfm_value_type values[16];
+} bfm_tree_node_leaf_16;
+
+typedef struct bfm_tree_node_leaf_32
+{
+ bfm_tree_node_leaf b;
+ uint8 chunks[32];
+ bfm_value_type values[32];
+} bfm_tree_node_leaf_32;
+
+typedef struct bfm_tree_node_leaf_128
+{
+ bfm_tree_node_leaf b;
+ uint8 offsets[BFM_MAX_CLASS];
+ bfm_value_type values[128];
+} bfm_tree_node_leaf_128;
+
+typedef struct bfm_tree_node_leaf_max
+{
+ bfm_tree_node_leaf b;
+ uint8 set[BFM_MAX_CLASS / (sizeof(uint8) * BITS_PER_BYTE)];
+ bfm_value_type values[BFM_MAX_CLASS];
+} bfm_tree_node_leaf_max;
+
+
+typedef struct bfm_tree_size_class_info
+{
+ const char *const name;
+ int elements;
+ size_t size;
+} bfm_tree_size_class_info;
+
+const bfm_tree_size_class_info inner_class_info[] =
+{
+ [BFM_KIND_1] = {"1", 1, sizeof(bfm_tree_node_inner_1)},
+ [BFM_KIND_4] = {"4", 4, sizeof(bfm_tree_node_inner_4)},
+ [BFM_KIND_16] = {"16", 16, sizeof(bfm_tree_node_inner_16)},
+ [BFM_KIND_32] = {"32", 32, sizeof(bfm_tree_node_inner_32)},
+ [BFM_KIND_128] = {"128", 128, sizeof(bfm_tree_node_inner_128)},
+ [BFM_KIND_MAX] = {"max", BFM_MAX_CLASS, sizeof(bfm_tree_node_inner_max)},
+};
+
+const bfm_tree_size_class_info leaf_class_info[] =
+{
+ [BFM_KIND_1] = {"1", 1, sizeof(bfm_tree_node_leaf_1)},
+ [BFM_KIND_4] = {"4", 4, sizeof(bfm_tree_node_leaf_4)},
+ [BFM_KIND_16] = {"16", 16, sizeof(bfm_tree_node_leaf_16)},
+ [BFM_KIND_32] = {"32", 32, sizeof(bfm_tree_node_leaf_32)},
+ [BFM_KIND_128] = {"128", 128, sizeof(bfm_tree_node_leaf_128)},
+ [BFM_KIND_MAX] = {"max", BFM_MAX_CLASS, sizeof(bfm_tree_node_leaf_max)},
+};
+
+static void *
+bfm_alloc_node(bfm_tree *root, bool inner, bfm_tree_node_kind kind, size_t size)
+{
+ bfm_tree_node *node;
+
+#ifdef BFM_USE_SLAB
+ if (inner)
+ node = (bfm_tree_node *) MemoryContextAlloc(root->inner_slabs[kind], size);
+ else
+ node = (bfm_tree_node *) MemoryContextAlloc(root->leaf_slabs[kind], size);
+#elif defined(BFM_USE_OS)
+ node = (bfm_tree_node *) malloc(size);
+#else
+ node = (bfm_tree_node *) MemoryContextAlloc(root->context, size);
+#endif
+
+ return node;
+}
+
+static bfm_tree_node_inner *
+bfm_alloc_inner(bfm_tree *root, bfm_tree_node_kind kind, size_t size)
+{
+ bfm_tree_node_inner *node;
+
+ Assert(inner_class_info[kind].size == size);
+#ifdef BFM_STATS
+ root->inner_nodes[kind]++;
+#endif
+
+ node = bfm_alloc_node(root, true, kind, size);
+
+ memset(&node->b, 0, sizeof(node->b));
+ node->b.kind = kind;
+
+ return node;
+}
+
+static bfm_tree_node_inner *
+bfm_alloc_leaf(bfm_tree *root, bfm_tree_node_kind kind, size_t size)
+{
+ bfm_tree_node_inner *node;
+
+ Assert(leaf_class_info[kind].size == size);
+#ifdef BFM_STATS
+ root->leaf_nodes[kind]++;
+#endif
+
+ node = bfm_alloc_node(root, false, kind, size);
+
+ memset(&node->b, 0, sizeof(node->b));
+ node->b.kind = kind;
+
+ return node;
+}
+
+
+static bfm_tree_node_inner_1 *
+bfm_alloc_inner_1(bfm_tree *root)
+{
+ bfm_tree_node_inner_1 *node =
+ (bfm_tree_node_inner_1 *) bfm_alloc_inner(root, BFM_KIND_1, sizeof(*node));
+
+ return node;
+}
+
+#define BFM_TREE_NODE_INNER_4_INVALID 0xFF
+static bfm_tree_node_inner_4 *
+bfm_alloc_inner_4(bfm_tree *root)
+{
+ bfm_tree_node_inner_4 *node =
+ (bfm_tree_node_inner_4 *) bfm_alloc_inner(root, BFM_KIND_4, sizeof(*node));
+
+ return node;
+}
+
+#define BFM_TREE_NODE_INNER_16_INVALID 0xFF
+static bfm_tree_node_inner_16 *
+bfm_alloc_inner_16(bfm_tree *root)
+{
+ bfm_tree_node_inner_16 *node =
+ (bfm_tree_node_inner_16 *) bfm_alloc_inner(root, BFM_KIND_16, sizeof(*node));
+
+ return node;
+}
+
+#define BFM_TREE_NODE_INNER_32_INVALID 0xFF
+static bfm_tree_node_inner_32 *
+bfm_alloc_inner_32(bfm_tree *root)
+{
+ bfm_tree_node_inner_32 *node =
+ (bfm_tree_node_inner_32 *) bfm_alloc_inner(root, BFM_KIND_32, sizeof(*node));
+
+ return node;
+}
+
+static bfm_tree_node_inner_128 *
+bfm_alloc_inner_128(bfm_tree *root)
+{
+ bfm_tree_node_inner_128 *node =
+ (bfm_tree_node_inner_128 *) bfm_alloc_inner(root, BFM_KIND_128, sizeof(*node));
+
+ memset(&node->offsets, BFM_TREE_NODE_128_INVALID, sizeof(node->offsets));
+
+ return node;
+}
+
+static bfm_tree_node_inner_max *
+bfm_alloc_inner_max(bfm_tree *root)
+{
+ bfm_tree_node_inner_max *node =
+ (bfm_tree_node_inner_max *) bfm_alloc_inner(root, BFM_KIND_MAX, sizeof(*node));
+
+ memset(&node->slots, 0, sizeof(node->slots));
+
+ return node;
+}
+
+static bfm_tree_node_leaf_1 *
+bfm_alloc_leaf_1(bfm_tree *root)
+{
+ bfm_tree_node_leaf_1 *node =
+ (bfm_tree_node_leaf_1 *) bfm_alloc_leaf(root, BFM_KIND_1, sizeof(*node));
+
+ return node;
+}
+
+static bfm_tree_node_leaf_4 *
+bfm_alloc_leaf_4(bfm_tree *root)
+{
+ bfm_tree_node_leaf_4 *node =
+ (bfm_tree_node_leaf_4 *) bfm_alloc_leaf(root, BFM_KIND_4, sizeof(*node));
+
+ return node;
+}
+
+static bfm_tree_node_leaf_16 *
+bfm_alloc_leaf_16(bfm_tree *root)
+{
+ bfm_tree_node_leaf_16 *node =
+ (bfm_tree_node_leaf_16 *) bfm_alloc_leaf(root, BFM_KIND_16, sizeof(*node));
+
+ return node;
+}
+
+static bfm_tree_node_leaf_32 *
+bfm_alloc_leaf_32(bfm_tree *root)
+{
+ bfm_tree_node_leaf_32 *node =
+ (bfm_tree_node_leaf_32 *) bfm_alloc_leaf(root, BFM_KIND_32, sizeof(*node));
+
+ return node;
+}
+
+static bfm_tree_node_leaf_128 *
+bfm_alloc_leaf_128(bfm_tree *root)
+{
+ bfm_tree_node_leaf_128 *node =
+ (bfm_tree_node_leaf_128 *) bfm_alloc_leaf(root, BFM_KIND_128, sizeof(*node));
+
+ memset(node->offsets, BFM_TREE_NODE_128_INVALID, sizeof(node->offsets));
+
+ return node;
+}
+
+static bfm_tree_node_leaf_max *
+bfm_alloc_leaf_max(bfm_tree *root)
+{
+ bfm_tree_node_leaf_max *node =
+ (bfm_tree_node_leaf_max *) bfm_alloc_leaf(root, BFM_KIND_MAX, sizeof(*node));
+
+ memset(node->set, 0, sizeof(node->set));
+
+ return node;
+}
+
+static void
+bfm_free_internal(bfm_tree *root, void *p)
+{
+#if defined(BFM_USE_OS)
+ free(p);
+#else
+ pfree(p);
+#endif
+}
+
+static void
+bfm_free_inner(bfm_tree *root, bfm_tree_node_inner *node)
+{
+ Assert(node->b.node_shift != 0);
+
+#ifdef BFM_STATS
+ root->inner_nodes[node->b.kind]--;
+#endif
+
+ bfm_free_internal(root, node);
+}
+
+static void
+bfm_free_leaf(bfm_tree *root, bfm_tree_node_leaf *node)
+{
+ Assert(node->b.node_shift == 0);
+
+#ifdef BFM_STATS
+ root->leaf_nodes[node->b.kind]--;
+#endif
+
+ bfm_free_internal(root, node);
+}
+
+#define BFM_LEAF_MAX_SET_OFFSET(i) (i / (sizeof(uint8) * BITS_PER_BYTE))
+#define BFM_LEAF_MAX_SET_BIT(i) (UINT64_C(1) << (i & ((sizeof(uint8) * BITS_PER_BYTE)-1)))
+
+static inline bool
+bfm_leaf_max_isset(bfm_tree_node_leaf_max *node_max, uint32 i)
+{
+ return node_max->set[BFM_LEAF_MAX_SET_OFFSET(i)] & BFM_LEAF_MAX_SET_BIT(i);
+}
+
+static inline void
+bfm_leaf_max_set(bfm_tree_node_leaf_max *node_max, uint32 i)
+{
+ node_max->set[BFM_LEAF_MAX_SET_OFFSET(i)] |= BFM_LEAF_MAX_SET_BIT(i);
+}
+
+static inline void
+bfm_leaf_max_unset(bfm_tree_node_leaf_max *node_max, uint32 i)
+{
+ node_max->set[BFM_LEAF_MAX_SET_OFFSET(i)] &= ~BFM_LEAF_MAX_SET_BIT(i);
+}
+
+static uint64
+bfm_maxval_shift(uint32 shift)
+{
+ uint32 maxshift = (sizeof(bfm_key_type) * BITS_PER_BYTE) / BFM_FANOUT * BFM_FANOUT;
+
+ Assert(shift <= maxshift);
+
+ if (shift == maxshift)
+ return UINT64_MAX;
+
+ return (UINT64_C(1) << (shift + BFM_FANOUT)) - 1;
+}
+
+static inline int
+search_chunk_array_4_eq(uint8 *chunks, uint8 match, uint8 count)
+{
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (chunks[i] == match)
+ {
+ index = i;
+ break;
+ }
+ }
+
+ return index;
+}
+
+static inline int
+search_chunk_array_4_le(uint8 *chunks, uint8 match, uint8 count)
+{
+ int index;
+
+ for (index = 0; index < count; index++)
+ if (chunks[index] >= match)
+ break;
+
+ return index;
+}
+
+
+#if defined(__SSE2__)
+#include <emmintrin.h> // x86 SSE intrinsics
+#endif
+
+static inline int
+search_chunk_array_16_eq(uint8 *chunks, uint8 match, uint8 count)
+{
+#if !defined(__SSE2__) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+#endif
+
+#ifdef __SSE2__
+ int index_sse;
+ __m128i spread_chunk = _mm_set1_epi8(match);
+ __m128i haystack = _mm_loadu_si128((__m128i_u*) chunks);
+ __m128i cmp=_mm_cmpeq_epi8(spread_chunk, haystack);
+ uint32_t bitfield=_mm_movemask_epi8(cmp);
+
+ bitfield &= ((1<<count)-1);
+
+ if (bitfield)
+ index_sse = __builtin_ctz(bitfield);
+ else
+ index_sse = -1;
+
+#endif
+
+#if !defined(__SSE2__) || defined(USE_ASSERT_CHECKING)
+ for (int i = 0; i < count; i++)
+ {
+ if (chunks[i] == match)
+ {
+ index = i;
+ break;
+ }
+ }
+
+#if defined(__SSE2__)
+ Assert(index_sse == index);
+#endif
+
+#endif
+
+#if defined(__SSE2__)
+ return index_sse;
+#else
+ return index;
+#endif
+}
+
+/*
+ * This is a bit more complicated than search_chunk_array_16_eq(), because
+ * until recently no unsigned uint8 comparison instruction existed on x86. So
+ * we need to play some trickery using _mm_min_epu8() to effectively get
+ * <=. There never will be any equal elements in the current uses, but that's
+ * what we get here...
+ */
+static inline int
+search_chunk_array_16_le(uint8 *chunks, uint8 match, uint8 count)
+{
+#if !defined(__SSE2__) || defined(USE_ASSERT_CHECKING)
+ int index;
+#endif
+
+#ifdef __SSE2__
+ int index_sse;
+ __m128i spread_chunk = _mm_set1_epi8(match);
+ __m128i haystack = _mm_loadu_si128((__m128i_u*) chunks);
+ __m128i min = _mm_min_epu8(haystack, spread_chunk);
+ __m128i cmp = _mm_cmpeq_epi8(spread_chunk, min);
+ uint32_t bitfield=_mm_movemask_epi8(cmp);
+
+ bitfield &= ((1<<count)-1);
+
+ if (bitfield)
+ index_sse = __builtin_ctz(bitfield);
+ else
+ index_sse = count;
+#endif
+
+#if !defined(__SSE2__) || defined(USE_ASSERT_CHECKING)
+ for (index = 0; index < count; index++)
+ if (chunks[index] >= match)
+ break;
+
+#if defined(__SSE2__)
+ Assert(index_sse == index);
+#endif
+
+#endif
+
+#if defined(__SSE2__)
+ return index_sse;
+#else
+ return index;
+#endif
+}
+
+#if defined(__AVX2__)
+#include <immintrin.h> // x86 SSE intrinsics
+#endif
+
+static inline int
+search_chunk_array_32_eq(uint8 *chunks, uint8 match, uint8 count)
+{
+#if !defined(__AVX2__) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+#endif
+
+#ifdef __AVX2__
+ int index_sse;
+ __m256i spread_chunk = _mm256_set1_epi8(match);
+ __m256i haystack = _mm256_loadu_si256((__m256i_u*) chunks);
+ __m256i cmp= _mm256_cmpeq_epi8(spread_chunk, haystack);
+ uint32_t bitfield = _mm256_movemask_epi8(cmp);
+
+ bitfield &= ((UINT64_C(1)<<count)-1);
+
+ if (bitfield)
+ index_sse = __builtin_ctz(bitfield);
+ else
+ index_sse = -1;
+
+#endif
+
+#if !defined(__AVX2__) || defined(USE_ASSERT_CHECKING)
+ for (int i = 0; i < count; i++)
+ {
+ if (chunks[i] == match)
+ {
+ index = i;
+ break;
+ }
+ }
+
+#if defined(__AVX2__)
+ Assert(index_sse == index);
+#endif
+
+#endif
+
+#if defined(__AVX2__)
+ return index_sse;
+#else
+ return index;
+#endif
+}
+
+/*
+ * This is a bit more complicated than search_chunk_array_16_eq(), because
+ * until recently no unsigned uint8 comparison instruction existed on x86. So
+ * we need to play some trickery using _mm_min_epu8() to effectively get
+ * <=. There never will be any equal elements in the current uses, but that's
+ * what we get here...
+ */
+static inline int
+search_chunk_array_32_le(uint8 *chunks, uint8 match, uint8 count)
+{
+#if !defined(__AVX2__) || defined(USE_ASSERT_CHECKING)
+ int index;
+#endif
+
+#ifdef __AVX2__
+ int index_sse;
+ __m256i spread_chunk = _mm256_set1_epi8(match);
+ __m256i haystack = _mm256_loadu_si256((__m256i_u*) chunks);
+ __m256i min = _mm256_min_epu8(haystack, spread_chunk);
+ __m256i cmp=_mm256_cmpeq_epi8(spread_chunk, min);
+ uint32_t bitfield=_mm256_movemask_epi8(cmp);
+
+ bitfield &= ((1<<count)-1);
+
+ if (bitfield)
+ index_sse = __builtin_ctz(bitfield);
+ else
+ index_sse = count;
+#endif
+
+#if !defined(__AVX2__) || defined(USE_ASSERT_CHECKING)
+ for (index = 0; index < count; index++)
+ if (chunks[index] >= match)
+ break;
+
+#if defined(__AVX2__)
+ Assert(index_sse == index);
+#endif
+
+#endif
+
+#if defined(__AVX2__)
+ return index_sse;
+#else
+ return index;
+#endif
+}
+
+static inline void
+chunk_slot_array_grow(uint8 *source_chunks, bfm_tree_node **source_slots,
+ uint8 *target_chunks, bfm_tree_node **target_slots,
+ bfm_tree_node_inner *oldnode, bfm_tree_node_inner *newnode)
+{
+ memcpy(target_chunks, source_chunks, sizeof(source_chunks[0]) * oldnode->b.count);
+ memcpy(target_slots, source_slots, sizeof(source_slots[0]) * oldnode->b.count);
+
+ for (int i = 0; i < oldnode->b.count; i++)
+ {
+ Assert(source_slots[i]->parent == oldnode);
+ source_slots[i]->parent = newnode;
+ }
+}
+
+/*
+ * FIXME: Find a way to deduplicate with bfm_find_one_level_inner()
+ */
+pg_attribute_always_inline static bfm_tree_node *
+bfm_find_one_level_inner(bfm_tree_node_inner * pg_restrict node, uint8 chunk)
+{
+ bfm_tree_node *slot = NULL;
+
+ Assert(node->b.node_shift != 0); /* is inner node */
+
+ /* tell the compiler it doesn't need a bounds check */
+ if ((bfm_tree_node_kind) node->b.kind > BFM_KIND_MAX)
+ pg_unreachable();
+
+ switch((bfm_tree_node_kind) node->b.kind)
+ {
+ case BFM_KIND_1:
+ {
+ bfm_tree_node_inner_1 *node_1 =
+ (bfm_tree_node_inner_1 *) node;
+
+ Assert(node_1->b.b.count <= 1);
+ if (node_1->chunk == chunk)
+ slot = node_1->slot;
+ break;
+ }
+
+ case BFM_KIND_4:
+ {
+ bfm_tree_node_inner_4 *node_4 =
+ (bfm_tree_node_inner_4 *) node;
+ int index;
+
+ Assert(node_4->b.b.count <= 4);
+ index = search_chunk_array_4_eq(node_4->chunks, chunk, node_4->b.b.count);
+
+ if (index != -1)
+ slot = node_4->slots[index];
+
+ break;
+ }
+
+ case BFM_KIND_16:
+ {
+ bfm_tree_node_inner_16 *node_16 =
+ (bfm_tree_node_inner_16 *) node;
+ int index;
+
+ Assert(node_16->b.b.count <= 16);
+
+ index = search_chunk_array_16_eq(node_16->chunks, chunk, node_16->b.b.count);
+ if (index != -1)
+ slot = node_16->slots[index];
+
+ break;
+ }
+
+ case BFM_KIND_32:
+ {
+ bfm_tree_node_inner_32 *node_32 =
+ (bfm_tree_node_inner_32 *) node;
+ int index;
+
+ Assert(node_32->b.b.count <= 32);
+
+ index = search_chunk_array_32_eq(node_32->chunks, chunk, node_32->b.b.count);
+ if (index != -1)
+ slot = node_32->slots[index];
+
+ break;
+ }
+
+ case BFM_KIND_128:
+ {
+ bfm_tree_node_inner_128 *node_128 =
+ (bfm_tree_node_inner_128 *) node;
+
+ Assert(node_128->b.b.count <= 128);
+
+ if (node_128->offsets[chunk] != BFM_TREE_NODE_128_INVALID)
+ {
+ slot = node_128->slots[node_128->offsets[chunk]];
+ }
+ break;
+ }
+
+ case BFM_KIND_MAX:
+ {
+ bfm_tree_node_inner_max *node_max =
+ (bfm_tree_node_inner_max *) node;
+
+ Assert(node_max->b.b.count <= BFM_MAX_CLASS);
+ slot = node_max->slots[chunk];
+
+ break;
+ }
+ }
+
+ return slot;
+}
+
+/*
+ * FIXME: Find a way to deduplicate with bfm_find_one_level_inner()
+ */
+pg_attribute_always_inline static bool
+bfm_find_one_level_leaf(bfm_tree_node_leaf * pg_restrict node, uint8 chunk, bfm_value_type * pg_restrict valp)
+{
+ bool found = false;
+
+ Assert(node->b.node_shift == 0); /* is leaf node */
+
+ /* tell the compiler it doesn't need a bounds check */
+ if ((bfm_tree_node_kind) node->b.kind > BFM_KIND_MAX)
+ pg_unreachable();
+
+ switch((bfm_tree_node_kind) node->b.kind)
+ {
+ case BFM_KIND_1:
+ {
+ bfm_tree_node_leaf_1 *node_1 =
+ (bfm_tree_node_leaf_1 *) node;
+
+ Assert(node_1->b.b.count <= 1);
+ if (node_1->b.b.count == 1 &&
+ node_1->chunk == chunk)
+ {
+ *valp = node_1->value;
+ found = true;
+ break;
+ }
+ break;
+ }
+
+ case BFM_KIND_4:
+ {
+ bfm_tree_node_leaf_4 *node_4 =
+ (bfm_tree_node_leaf_4 *) node;
+ int index;
+
+ Assert(node_4->b.b.count <= 4);
+ index = search_chunk_array_4_eq(node_4->chunks, chunk, node_4->b.b.count);
+
+ if (index != -1)
+ {
+ *valp = node_4->values[index];
+ found = true;
+ }
+ break;
+ }
+
+ case BFM_KIND_16:
+ {
+ bfm_tree_node_leaf_16 *node_16 =
+ (bfm_tree_node_leaf_16 *) node;
+ int index;
+
+ Assert(node_16->b.b.count <= 16);
+
+ index = search_chunk_array_16_eq(node_16->chunks, chunk, node_16->b.b.count);
+ if (index != -1)
+ {
+ *valp = node_16->values[index];
+ found = true;
+ break;
+ }
+ break;
+ }
+
+ case BFM_KIND_32:
+ {
+ bfm_tree_node_leaf_32 *node_32 =
+ (bfm_tree_node_leaf_32 *) node;
+ int index;
+
+ Assert(node_32->b.b.count <= 32);
+
+ index = search_chunk_array_32_eq(node_32->chunks, chunk, node_32->b.b.count);
+ if (index != -1)
+ {
+ *valp = node_32->values[index];
+ found = true;
+ break;
+ }
+ break;
+ }
+
+ case BFM_KIND_128:
+ {
+ bfm_tree_node_leaf_128 *node_128 =
+ (bfm_tree_node_leaf_128 *) node;
+
+ Assert(node_128->b.b.count <= 128);
+
+ if (node_128->offsets[chunk] != BFM_TREE_NODE_128_INVALID)
+ {
+ *valp = node_128->values[node_128->offsets[chunk]];
+ found = true;
+ }
+ break;
+ }
+
+ case BFM_KIND_MAX:
+ {
+ bfm_tree_node_leaf_max *node_max =
+ (bfm_tree_node_leaf_max *) node;
+
+ Assert(node_max->b.b.count <= BFM_MAX_CLASS);
+
+ if (bfm_leaf_max_isset(node_max, chunk))
+ {
+ *valp = node_max->values[chunk];
+ found = true;
+ }
+ break;
+ }
+ }
+
+ return found;
+}
+
+pg_attribute_always_inline static bool
+bfm_walk(bfm_tree *root, bfm_tree_node **nodep, bfm_value_type *valp, uint64_t key)
+{
+ bfm_tree_node *rnode;
+ bfm_tree_node *cur;
+ uint8 chunk;
+ uint32 shift;
+
+ rnode = root->rnode;
+
+ /* can't be contained in the tree */
+ if (!rnode || key > root->maxval)
+ {
+ *nodep = NULL;
+ return false;
+ }
+
+ shift = rnode->node_shift;
+ chunk = (key >> shift) & BFM_MASK;
+ cur = rnode;
+
+ while (shift > 0)
+ {
+ bfm_tree_node_inner *cur_inner;
+ bfm_tree_node *slot;
+
+ Assert(cur->node_shift > 0); /* leaf nodes look different */
+ Assert(cur->node_shift == shift);
+
+ cur_inner = (bfm_tree_node_inner *) cur;
+
+ slot = bfm_find_one_level_inner(cur_inner, chunk);
+
+ if (slot == NULL)
+ {
+ *nodep = cur;
+ return false;
+ }
+
+ Assert(&slot->parent->b == cur);
+ Assert(slot->node_chunk == chunk);
+
+ cur = slot;
+ shift -= BFM_FANOUT;
+ chunk = (key >> shift) & BFM_MASK;
+ }
+
+ Assert(cur->node_shift == shift && shift == 0);
+
+ *nodep = cur;
+
+ return bfm_find_one_level_leaf((bfm_tree_node_leaf*) cur, chunk, valp);
+}
+
+/*
+ * Redirect parent pointers to oldnode by newnode, for the key chunk
+ * chunk. Used when growing or shrinking nodes.
+ */
+static void
+bfm_redirect(bfm_tree *root, bfm_tree_node *oldnode, bfm_tree_node *newnode, uint8 chunk)
+{
+ bfm_tree_node_inner *parent = oldnode->parent;
+
+ if (parent == NULL)
+ {
+ Assert(root->rnode == oldnode);
+ root->rnode = newnode;
+ return;
+ }
+
+ /* if there is a parent, it needs to be an inner node */
+ Assert(parent->b.node_shift != 0);
+
+ if ((bfm_tree_node_kind) parent->b.kind > BFM_KIND_MAX)
+ pg_unreachable();
+
+ switch((bfm_tree_node_kind) parent->b.kind)
+ {
+ case BFM_KIND_1:
+ {
+ bfm_tree_node_inner_1 *parent_1 =
+ (bfm_tree_node_inner_1 *) parent;
+
+ Assert(parent_1->slot == oldnode);
+ Assert(parent_1->chunk == chunk);
+
+ parent_1->slot = newnode;
+ break;
+ }
+
+ case BFM_KIND_4:
+ {
+ bfm_tree_node_inner_4 *parent_4 =
+ (bfm_tree_node_inner_4 *) parent;
+ int index;
+
+ Assert(parent_4->b.b.count <= 4);
+ index = search_chunk_array_4_eq(parent_4->chunks, chunk, parent_4->b.b.count);
+ Assert(index != -1);
+
+ Assert(parent_4->slots[index] == oldnode);
+ parent_4->slots[index] = newnode;
+
+ break;
+ }
+
+ case BFM_KIND_16:
+ {
+ bfm_tree_node_inner_16 *parent_16 =
+ (bfm_tree_node_inner_16 *) parent;
+ int index;
+
+ index = search_chunk_array_16_eq(parent_16->chunks, chunk, parent_16->b.b.count);
+ Assert(index != -1);
+
+ Assert(parent_16->slots[index] == oldnode);
+ parent_16->slots[index] = newnode;
+ break;
+ }
+
+ case BFM_KIND_32:
+ {
+ bfm_tree_node_inner_32 *parent_32 =
+ (bfm_tree_node_inner_32 *) parent;
+ int index;
+
+ index = search_chunk_array_32_eq(parent_32->chunks, chunk, parent_32->b.b.count);
+ Assert(index != -1);
+
+ Assert(parent_32->slots[index] == oldnode);
+ parent_32->slots[index] = newnode;
+ break;
+ }
+
+ case BFM_KIND_128:
+ {
+ bfm_tree_node_inner_128 *parent_128 =
+ (bfm_tree_node_inner_128 *) parent;
+ uint8 offset;
+
+ offset = parent_128->offsets[chunk];
+ Assert(offset != BFM_TREE_NODE_128_INVALID);
+ Assert(parent_128->slots[offset] == oldnode);
+ parent_128->slots[offset] = newnode;
+ break;
+ }
+
+ case BFM_KIND_MAX:
+ {
+ bfm_tree_node_inner_max *parent_max =
+ (bfm_tree_node_inner_max *) parent;
+
+ Assert(parent_max->slots[chunk] == oldnode);
+ parent_max->slots[chunk] = newnode;
+
+ break;
+ }
+ }
+}
+
+static void
+bfm_node_copy_common(bfm_tree *root, bfm_tree_node *oldnode, bfm_tree_node *newnode)
+{
+ newnode->node_shift = oldnode->node_shift;
+ newnode->node_chunk = oldnode->node_chunk;
+ newnode->count = oldnode->count;
+ newnode->parent = oldnode->parent;
+}
+
+/*
+ * Insert child into node.
+ *
+ * NB: `node` cannot be used after this call anymore, it changes if the node
+ * needs to be grown to fit the insertion.
+ *
+ * FIXME: Find a way to deduplicate with bfm_set_leaf()
+ */
+static void
+bfm_insert_inner(bfm_tree *root, bfm_tree_node_inner *node, bfm_tree_node *child, int child_chunk)
+{
+ Assert(node->b.node_shift != 0); /* is inner node */
+
+ child->node_chunk = child_chunk;
+
+ /* tell the compiler it doesn't need a bounds check */
+ if ((bfm_tree_node_kind) node->b.kind > BFM_KIND_MAX)
+ pg_unreachable();
+
+ switch((bfm_tree_node_kind) node->b.kind)
+ {
+ case BFM_KIND_1:
+ {
+ bfm_tree_node_inner_1 *node_1 =
+ (bfm_tree_node_inner_1 *) node;
+
+ Assert(node_1->b.b.count <= 1);
+
+ if (unlikely(node_1->b.b.count == 1))
+ {
+ /* grow node from 1 -> 4 */
+ bfm_tree_node_inner_4 *newnode_4;
+
+ newnode_4 = bfm_alloc_inner_4(root);
+ bfm_node_copy_common(root, &node->b, &newnode_4->b.b);
+
+ Assert(node_1->slot->parent != NULL);
+ Assert(node_1->slot->parent == node);
+ newnode_4->chunks[0] = node_1->chunk;
+ newnode_4->slots[0] = node_1->slot;
+ node_1->slot->parent = &newnode_4->b;
+
+ bfm_redirect(root, &node->b, &newnode_4->b.b, newnode_4->b.b.node_chunk);
+ bfm_free_inner(root, node);
+ node = &newnode_4->b;
+ }
+ else
+ {
+ child->parent = node;
+ node_1->chunk = child_chunk;
+ node_1->slot = child;
+ break;
+ }
+ }
+ /* fallthrough */
+
+ case BFM_KIND_4:
+ {
+ bfm_tree_node_inner_4 *node_4 =
+ (bfm_tree_node_inner_4 *) node;
+
+ Assert(node_4->b.b.count <= 4);
+ if (unlikely(node_4->b.b.count == 4))
+ {
+ /* grow node from 4 -> 16 */
+ bfm_tree_node_inner_16 *newnode_16;
+
+ newnode_16 = bfm_alloc_inner_16(root);
+ bfm_node_copy_common(root, &node->b, &newnode_16->b.b);
+
+ chunk_slot_array_grow(node_4->chunks, node_4->slots,
+ newnode_16->chunks, newnode_16->slots,
+ &node_4->b, &newnode_16->b);
+
+ bfm_redirect(root, &node->b, &newnode_16->b.b, newnode_16->b.b.node_chunk);
+ bfm_free_inner(root, node);
+ node = &newnode_16->b;
+ }
+ else
+ {
+ int insertpos;
+
+ for (insertpos = 0; insertpos < node_4->b.b.count; insertpos++)
+ if (node_4->chunks[insertpos] >= child_chunk)
+ break;
+
+ child->parent = node;
+
+ memmove(&node_4->slots[insertpos + 1],
+ &node_4->slots[insertpos],
+ (node_4->b.b.count - insertpos) * sizeof(node_4->slots[0]));
+ memmove(&node_4->chunks[insertpos + 1],
+ &node_4->chunks[insertpos],
+ (node_4->b.b.count - insertpos) * sizeof(node_4->chunks[0]));
+
+ node_4->chunks[insertpos] = child_chunk;
+ node_4->slots[insertpos] = child;
+ break;
+ }
+ }
+ /* fallthrough */
+
+ case BFM_KIND_16:
+ {
+ bfm_tree_node_inner_16 *node_16 =
+ (bfm_tree_node_inner_16 *) node;
+
+ Assert(node_16->b.b.count <= 16);
+ if (unlikely(node_16->b.b.count == 16))
+ {
+ /* grow node from 16 -> 32 */
+ bfm_tree_node_inner_32 *newnode_32;
+
+ newnode_32 = bfm_alloc_inner_32(root);
+ bfm_node_copy_common(root, &node->b, &newnode_32->b.b);
+
+ chunk_slot_array_grow(node_16->chunks, node_16->slots,
+ newnode_32->chunks, newnode_32->slots,
+ &node_16->b, &newnode_32->b);
+
+ bfm_redirect(root, &node->b, &newnode_32->b.b, newnode_32->b.b.node_chunk);
+ bfm_free_inner(root, node);
+ node = &newnode_32->b;
+ }
+ else
+ {
+ int insertpos;
+
+ insertpos = search_chunk_array_16_le(node_16->chunks, child_chunk, node_16->b.b.count);
+
+ child->parent = node;
+
+ memmove(&node_16->slots[insertpos + 1],
+ &node_16->slots[insertpos],
+ (node_16->b.b.count - insertpos) * sizeof(node_16->slots[0]));
+ memmove(&node_16->chunks[insertpos + 1],
+ &node_16->chunks[insertpos],
+ (node_16->b.b.count - insertpos) * sizeof(node_16->chunks[0]));
+
+ node_16->chunks[insertpos] = child_chunk;
+ node_16->slots[insertpos] = child;
+ break;
+ }
+ }
+ /* fallthrough */
+
+ case BFM_KIND_32:
+ {
+ bfm_tree_node_inner_32 *node_32 =
+ (bfm_tree_node_inner_32 *) node;
+
+ Assert(node_32->b.b.count <= 32);
+ if (unlikely(node_32->b.b.count == 32))
+ {
+ /* grow node from 32 -> 128 */
+ bfm_tree_node_inner_128 *newnode_128;
+
+ newnode_128 = bfm_alloc_inner_128(root);
+ bfm_node_copy_common(root, &node->b, &newnode_128->b.b);
+
+ memcpy(newnode_128->slots, node_32->slots, sizeof(node_32->slots));
+
+ /* change parent pointers of children */
+ for (int i = 0; i < 32; i++)
+ {
+ Assert(node_32->slots[i]->parent == node);
+ newnode_128->offsets[node_32->chunks[i]] = i;
+ node_32->slots[i]->parent = &newnode_128->b;
+ }
+
+ bfm_redirect(root, &node->b, &newnode_128->b.b, newnode_128->b.b.node_chunk);
+ bfm_free_inner(root, node);
+ node = &newnode_128->b;
+ }
+ else
+ {
+ int insertpos;
+
+ insertpos = search_chunk_array_32_le(node_32->chunks, child_chunk, node_32->b.b.count);
+
+ child->parent = node;
+
+ memmove(&node_32->slots[insertpos + 1],
+ &node_32->slots[insertpos],
+ (node_32->b.b.count - insertpos) * sizeof(node_32->slots[0]));
+ memmove(&node_32->chunks[insertpos + 1],
+ &node_32->chunks[insertpos],
+ (node_32->b.b.count - insertpos) * sizeof(node_32->chunks[0]));
+
+ node_32->chunks[insertpos] = child_chunk;
+ node_32->slots[insertpos] = child;
+ break;
+ }
+ }
+ /* fallthrough */
+
+ case BFM_KIND_128:
+ {
+ bfm_tree_node_inner_128 *node_128 =
+ (bfm_tree_node_inner_128 *) node;
+ uint8 offset;
+
+ Assert(node_128->b.b.count <= 128);
+ if (unlikely(node_128->b.b.count == 128))
+ {
+ /* grow node from 128 -> max */
+ bfm_tree_node_inner_max *newnode_max;
+
+ newnode_max = bfm_alloc_inner_max(root);
+ bfm_node_copy_common(root, &node->b, &newnode_max->b.b);
+
+ for (int i = 0; i < BFM_MAX_CLASS; i++)
+ {
+ uint8 offset = node_128->offsets[i];
+
+ if (offset == BFM_TREE_NODE_128_INVALID)
+ continue;
+
+ Assert(node_128->slots[offset] != NULL);
+ Assert(node_128->slots[offset]->parent == node);
+
+ node_128->slots[offset]->parent = &newnode_max->b;
+
+ newnode_max->slots[i] = node_128->slots[offset];
+ }
+
+ bfm_redirect(root, &node->b, &newnode_max->b.b, newnode_max->b.b.node_chunk);
+ bfm_free_inner(root, node);
+ node = &newnode_max->b;
+ }
+ else
+ {
+ child->parent = node;
+ offset = node_128->b.b.count;
+ /* FIXME: this may overwrite entry if there had been deletions */
+ node_128->offsets[child_chunk] = offset;
+ node_128->slots[offset] = child;
+ break;
+ }
+ }
+ /* fallthrough */
+
+ case BFM_KIND_MAX:
+ {
+ bfm_tree_node_inner_max *node_max =
+ (bfm_tree_node_inner_max *) node;
+
+ Assert(node_max->b.b.count <= (BFM_MAX_CLASS - 1));
+ Assert(node_max->slots[child_chunk] == NULL);
+
+ child->parent = node;
+ node_max->slots[child_chunk] = child;
+
+ break;
+ }
+ }
+
+ node->b.count++;
+}
+
+static bool pg_noinline
+bfm_grow_leaf_1(bfm_tree *root, bfm_tree_node_leaf_1 *node_1,
+ int child_chunk, bfm_value_type val)
+{
+ /* grow node from 1 -> 4 */
+ bfm_tree_node_leaf_4 *newnode_4;
+
+ Assert(node_1->b.b.count == 1);
+
+ newnode_4 = bfm_alloc_leaf_4(root);
+ bfm_node_copy_common(root, &node_1->b.b, &newnode_4->b.b);
+
+ /* copy old & insert new value in the right order */
+ if (child_chunk < node_1->chunk)
+ {
+ newnode_4->chunks[0] = child_chunk;
+ newnode_4->values[0] = val;
+ newnode_4->chunks[1] = node_1->chunk;
+ newnode_4->values[1] = node_1->value;
+ }
+ else
+ {
+ newnode_4->chunks[0] = node_1->chunk;
+ newnode_4->values[0] = node_1->value;
+ newnode_4->chunks[1] = child_chunk;
+ newnode_4->values[1] = val;
+ }
+
+ newnode_4->b.b.count++;
+#ifdef BFM_STATS
+ root->entries++;
+#endif
+
+ bfm_redirect(root, &node_1->b.b, &newnode_4->b.b, newnode_4->b.b.node_chunk);
+ bfm_free_leaf(root, &node_1->b);
+
+ return false;
+}
+
+static bool pg_noinline
+bfm_grow_leaf_4(bfm_tree *root, bfm_tree_node_leaf_4 *node_4,
+ int child_chunk, bfm_value_type val)
+{
+ /* grow node from 4 -> 16 */
+ bfm_tree_node_leaf_16 *newnode_16;
+ int insertpos;
+
+ Assert(node_4->b.b.count == 4);
+
+ newnode_16 = bfm_alloc_leaf_16(root);
+ bfm_node_copy_common(root, &node_4->b.b, &newnode_16->b.b);
+
+ insertpos = search_chunk_array_4_le(node_4->chunks, child_chunk, node_4->b.b.count);
+
+ /* first copy old elements ordering before */
+ memcpy(&newnode_16->chunks[0],
+ &node_4->chunks[0],
+ sizeof(node_4->chunks[0]) * insertpos);
+ memcpy(&newnode_16->values[0],
+ &node_4->values[0],
+ sizeof(node_4->values[0]) * insertpos);
+
+ /* then the new element */
+ newnode_16->chunks[insertpos] = child_chunk;
+ newnode_16->values[insertpos] = val;
+
+ /* and lastly the old elements after */
+ memcpy(&newnode_16->chunks[insertpos + 1],
+ &node_4->chunks[insertpos],
+ (node_4->b.b.count-insertpos) * sizeof(node_4->chunks[0]));
+ memcpy(&newnode_16->values[insertpos + 1],
+ &node_4->values[insertpos],
+ (node_4->b.b.count-insertpos) * sizeof(node_4->values[0]));
+
+ newnode_16->b.b.count++;
+#ifdef BFM_STATS
+ root->entries++;
+#endif
+
+ bfm_redirect(root, &node_4->b.b, &newnode_16->b.b, newnode_16->b.b.node_chunk);
+ bfm_free_leaf(root, &node_4->b);
+
+ return false;
+}
+
+static bool pg_noinline
+bfm_grow_leaf_16(bfm_tree *root, bfm_tree_node_leaf_16 *node_16,
+ int child_chunk, bfm_value_type val)
+{
+ /* grow node from 16 -> 32 */
+ bfm_tree_node_leaf_32 *newnode_32;
+ int insertpos;
+
+ Assert(node_16->b.b.count == 16);
+
+ newnode_32 = bfm_alloc_leaf_32(root);
+ bfm_node_copy_common(root, &node_16->b.b, &newnode_32->b.b);
+
+ insertpos = search_chunk_array_16_le(node_16->chunks, child_chunk, node_16->b.b.count);
+
+ /* first copy old elements ordering before */
+ memcpy(&newnode_32->chunks[0],
+ &node_16->chunks[0],
+ sizeof(node_16->chunks[0]) * insertpos);
+ memcpy(&newnode_32->values[0],
+ &node_16->values[0],
+ sizeof(node_16->values[0]) * insertpos);
+
+ /* then the new element */
+ newnode_32->chunks[insertpos] = child_chunk;
+ newnode_32->values[insertpos] = val;
+
+ /* and lastly the old elements after */
+ memcpy(&newnode_32->chunks[insertpos + 1],
+ &node_16->chunks[insertpos],
+ (node_16->b.b.count-insertpos) * sizeof(node_16->chunks[0]));
+ memcpy(&newnode_32->values[insertpos + 1],
+ &node_16->values[insertpos],
+ (node_16->b.b.count-insertpos) * sizeof(node_16->values[0]));
+
+ newnode_32->b.b.count++;
+#ifdef BFM_STATS
+ root->entries++;
+#endif
+
+ bfm_redirect(root, &node_16->b.b, &newnode_32->b.b, newnode_32->b.b.node_chunk);
+ bfm_free_leaf(root, &node_16->b);
+
+ return false;
+}
+
+static bool pg_noinline
+bfm_grow_leaf_32(bfm_tree *root, bfm_tree_node_leaf_32 *node_32,
+ int child_chunk, bfm_value_type val)
+{
+ /* grow node from 32 -> 128 */
+ bfm_tree_node_leaf_128 *newnode_128;
+ uint8 offset;
+
+ newnode_128 = bfm_alloc_leaf_128(root);
+ bfm_node_copy_common(root, &node_32->b.b, &newnode_128->b.b);
+
+ memcpy(newnode_128->values, node_32->values, sizeof(node_32->values));
+
+ for (int i = 0; i < 32; i++)
+ newnode_128->offsets[node_32->chunks[i]] = i;
+
+ offset = newnode_128->b.b.count;
+ newnode_128->offsets[child_chunk] = offset;
+ newnode_128->values[offset] = val;
+
+ newnode_128->b.b.count++;
+#ifdef BFM_STATS
+ root->entries++;
+#endif
+
+ bfm_redirect(root, &node_32->b.b, &newnode_128->b.b, newnode_128->b.b.node_chunk);
+ bfm_free_leaf(root, &node_32->b);
+
+ return false;
+}
+
+static bool pg_noinline
+bfm_grow_leaf_128(bfm_tree *root, bfm_tree_node_leaf_128 *node_128,
+ int child_chunk, bfm_value_type val)
+{
+ /* grow node from 128 -> max */
+ bfm_tree_node_leaf_max *newnode_max;
+ int i;
+
+ newnode_max = bfm_alloc_leaf_max(root);
+ bfm_node_copy_common(root, &node_128->b.b, &newnode_max->b.b);
+
+ /*
+ * The bitmask manipulation is a surprisingly large portion of the
+ * overhead in the naive implementation. Unrolling the bit manipulation
+ * removes a lot of that overhead.
+ */
+ i = 0;
+ for (int byte = 0; byte < BFM_MAX_CLASS / BITS_PER_BYTE; byte++)
+ {
+ uint8 bitmap = 0;
+
+ for (int bit = 0; bit < BITS_PER_BYTE; bit++)
+ {
+ uint8 offset = node_128->offsets[i];
+
+ if (offset != BFM_TREE_NODE_128_INVALID)
+ {
+ bitmap |= 1 << bit;
+ newnode_max->values[i] = node_128->values[offset];
+ }
+
+ i++;
+ }
+
+ newnode_max->set[byte] = bitmap;
+ }
+
+ bfm_leaf_max_set(newnode_max, child_chunk);
+ newnode_max->values[child_chunk] = val;
+ newnode_max->b.b.count++;
+#ifdef BFM_STATS
+ root->entries++;
+#endif
+
+ bfm_redirect(root, &node_128->b.b, &newnode_max->b.b, newnode_max->b.b.node_chunk);
+ bfm_free_leaf(root, &node_128->b);
+
+ return false;
+}
+
+/*
+ * Set key to val. Return false if entry doesn't yet exist, true if it did.
+ *
+ * See comments to bfm_insert_inner().
+ */
+static bool pg_noinline
+bfm_set_leaf(bfm_tree *root, bfm_key_type key, bfm_value_type val,
+ bfm_tree_node_leaf *node, int child_chunk)
+{
+ Assert(node->b.node_shift == 0); /* is leaf node */
+
+ /* tell the compiler it doesn't need a bounds check */
+ if ((bfm_tree_node_kind) node->b.kind > BFM_KIND_MAX)
+ pg_unreachable();
+
+ switch((bfm_tree_node_kind) node->b.kind)
+ {
+ case BFM_KIND_1:
+ {
+ bfm_tree_node_leaf_1 *node_1 =
+ (bfm_tree_node_leaf_1 *) node;
+
+ Assert(node_1->b.b.count <= 1);
+
+ if (node_1->b.b.count == 1 &&
+ node_1->chunk == child_chunk)
+ {
+ node_1->value = val;
+ return true;
+ }
+ else if (likely(node_1->b.b.count < 1))
+ {
+ node_1->chunk = child_chunk;
+ node_1->value = val;
+ }
+ else
+ return bfm_grow_leaf_1(root, node_1, child_chunk, val);
+
+ break;
+ }
+
+ case BFM_KIND_4:
+ {
+ bfm_tree_node_leaf_4 *node_4 =
+ (bfm_tree_node_leaf_4 *) node;
+ int index;
+
+ Assert(node_4->b.b.count <= 4);
+
+ index = search_chunk_array_4_eq(node_4->chunks, child_chunk, node_4->b.b.count);
+ if (index != -1)
+ {
+ node_4->values[index] = val;
+ return true;
+ }
+
+ if (likely(node_4->b.b.count < 4))
+ {
+ int insertpos;
+
+ insertpos = search_chunk_array_4_le(node_4->chunks, child_chunk, node_4->b.b.count);
+
+ for (int i = node_4->b.b.count - 1; i >= insertpos; i--)
+ {
+ /* workaround for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101481 */
+#ifdef __GNUC__
+ __asm__("");
+#endif
+ node_4->values[i + 1] = node_4->values[i];
+ node_4->chunks[i + 1] = node_4->chunks[i];
+ }
+
+ node_4->chunks[insertpos] = child_chunk;
+ node_4->values[insertpos] = val;
+ }
+ else
+ return bfm_grow_leaf_4(root, node_4, child_chunk, val);
+
+ break;
+ }
+
+ case BFM_KIND_16:
+ {
+ bfm_tree_node_leaf_16 *node_16 =
+ (bfm_tree_node_leaf_16 *) node;
+ int index;
+
+ Assert(node_16->b.b.count <= 16);
+
+ index = search_chunk_array_16_eq(node_16->chunks, child_chunk, node_16->b.b.count);
+ if (index != -1)
+ {
+ node_16->values[index] = val;
+ return true;
+ }
+
+ if (likely(node_16->b.b.count < 16))
+ {
+ int insertpos;
+
+ insertpos = search_chunk_array_16_le(node_16->chunks, child_chunk, node_16->b.b.count);
+
+ if (node_16->b.b.count > 16 || insertpos > 15)
+ pg_unreachable();
+
+ for (int i = node_16->b.b.count - 1; i >= insertpos; i--)
+ {
+ /* workaround for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101481 */
+#ifdef __GNUC__
+ __asm__("");
+#endif
+ node_16->values[i + 1] = node_16->values[i];
+ node_16->chunks[i + 1] = node_16->chunks[i];
+ }
+ node_16->chunks[insertpos] = child_chunk;
+ node_16->values[insertpos] = val;
+ }
+ else
+ return bfm_grow_leaf_16(root, node_16, child_chunk, val);
+
+ break;
+ }
+
+ case BFM_KIND_32:
+ {
+ bfm_tree_node_leaf_32 *node_32 =
+ (bfm_tree_node_leaf_32 *) node;
+ int index;
+
+ Assert(node_32->b.b.count <= 32);
+
+ index = search_chunk_array_32_eq(node_32->chunks, child_chunk, node_32->b.b.count);
+ if (index != -1)
+ {
+ node_32->values[index] = val;
+ return true;
+ }
+
+ if (likely(node_32->b.b.count < 32))
+ {
+ int insertpos;
+
+ insertpos = search_chunk_array_32_le(node_32->chunks, child_chunk, node_32->b.b.count);
+
+ if (node_32->b.b.count > 32 || insertpos > 31)
+ pg_unreachable();
+
+ for (int i = node_32->b.b.count - 1; i >= insertpos; i--)
+ {
+ /* workaround for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101481 */
+#ifdef __GNUC__
+ __asm__("");
+#endif
+ node_32->values[i + 1] = node_32->values[i];
+ node_32->chunks[i + 1] = node_32->chunks[i];
+ }
+ node_32->chunks[insertpos] = child_chunk;
+ node_32->values[insertpos] = val;
+ }
+ else
+ return bfm_grow_leaf_32(root, node_32, child_chunk, val);
+
+ break;
+ }
+
+ case BFM_KIND_128:
+ {
+ bfm_tree_node_leaf_128 *node_128 =
+ (bfm_tree_node_leaf_128 *) node;
+ uint8 offset;
+
+ Assert(node_128->b.b.count <= 128);
+
+ if (node_128->offsets[child_chunk] != BFM_TREE_NODE_128_INVALID)
+ {
+ offset = node_128->offsets[child_chunk];
+ node_128->values[offset] = val;
+
+ return true;
+ }
+ else if (likely(node_128->b.b.count < 128))
+ {
+ offset = node_128->b.b.count;
+ node_128->offsets[child_chunk] = offset;
+ node_128->values[offset] = val;
+ }
+ else
+ return bfm_grow_leaf_128(root, node_128, child_chunk, val);
+
+ break;
+ }
+
+ case BFM_KIND_MAX:
+ {
+ bfm_tree_node_leaf_max *node_max =
+ (bfm_tree_node_leaf_max *) node;
+
+ Assert(node_max->b.b.count <= (BFM_MAX_CLASS - 1));
+
+ if (bfm_leaf_max_isset(node_max, child_chunk))
+ {
+ node_max->values[child_chunk] = val;
+ return true;
+ }
+
+ bfm_leaf_max_set(node_max, child_chunk);
+ node_max->values[child_chunk] = val;
+
+ break;
+ }
+ }
+
+ node->b.count++;
+#ifdef BFM_STATS
+ root->entries++;
+#endif
+
+ return false;
+}
+
+static bool pg_noinline
+bfm_set_extend(bfm_tree *root, bfm_key_type key, bfm_value_type val,
+ bfm_tree_node_inner *cur_inner,
+ uint32 shift, uint8 chunk)
+{
+ bfm_tree_node_leaf_1 *new_leaf_1;
+
+ while (shift > BFM_FANOUT)
+ {
+ bfm_tree_node_inner_1 *new_inner_1;
+
+ Assert(shift == cur_inner->b.node_shift);
+
+ new_inner_1 = bfm_alloc_inner_1(root);
+ new_inner_1->b.b.node_shift = shift - BFM_FANOUT;
+
+ bfm_insert_inner(root, cur_inner, &new_inner_1->b.b, chunk);
+
+ shift -= BFM_FANOUT;
+ chunk = (key >> shift) & BFM_MASK;
+ cur_inner = &new_inner_1->b;
+ }
+
+ Assert(shift == BFM_FANOUT && cur_inner->b.node_shift == BFM_FANOUT);
+
+ new_leaf_1 = bfm_alloc_leaf_1(root);
+ new_leaf_1->b.b.count = 1;
+ new_leaf_1->b.b.node_shift = 0;
+
+ new_leaf_1->chunk = key & BFM_MASK;
+ new_leaf_1->value = val;
+
+#ifdef BFM_STATS
+ root->entries++;
+#endif
+
+ bfm_insert_inner(root, cur_inner, &new_leaf_1->b.b, chunk);
+
+ return false;
+}
+
+static bool pg_noinline
+bfm_set_empty(bfm_tree *root, bfm_key_type key, bfm_value_type val)
+{
+ uint32 shift;
+
+ Assert(root->rnode == NULL);
+
+ if (key == 0)
+ shift = 0;
+ else
+ shift = (pg_leftmost_one_pos64(key)/BFM_FANOUT)*BFM_FANOUT;
+
+ if (shift == 0)
+ {
+ bfm_tree_node_leaf_1 *nroot = bfm_alloc_leaf_1(root);
+
+ Assert((key & BFM_MASK) == key);
+
+ nroot->b.b.node_shift = 0;
+ nroot->b.b.node_chunk = 0;
+ nroot->b.b.parent = NULL;
+
+ root->maxval = bfm_maxval_shift(0);
+
+ root->rnode = &nroot->b.b;
+
+ return bfm_set_leaf(root, key, val, &nroot->b, key);
+ }
+ else
+ {
+ bfm_tree_node_inner_1 *nroot = bfm_alloc_inner_1(root);
+
+ nroot->b.b.node_shift = shift;
+ nroot->b.b.node_chunk = 0;
+ nroot->b.b.parent = NULL;
+
+ root->maxval = bfm_maxval_shift(shift);
+ root->rnode = &nroot->b.b;
+
+
+ return bfm_set_extend(root, key, val, &nroot->b,
+ shift, (key >> shift) & BFM_MASK);
+ }
+}
+
+/*
+ * Tree doesn't have sufficient height. Put new tree node(s) on top, move
+ * the old node below it, and then insert.
+ */
+static bool pg_noinline
+bfm_set_shallow(bfm_tree *root, bfm_key_type key, bfm_value_type val)
+{
+ uint32 shift;
+ bfm_tree_node_inner_1 *nroot = NULL;;
+
+ Assert(root->rnode != NULL);
+
+ if (key == 0)
+ shift = 0;
+ else
+ shift = (pg_leftmost_one_pos64(key)/BFM_FANOUT)*BFM_FANOUT;
+
+ Assert(root->rnode->node_shift < shift);
+
+ while (unlikely(root->rnode->node_shift < shift))
+ {
+ nroot = bfm_alloc_inner_1(root);
+
+ nroot->slot = root->rnode;
+ nroot->chunk = 0;
+ nroot->b.b.count = 1;
+ nroot->b.b.parent = NULL;
+ nroot->b.b.node_shift = root->rnode->node_shift + BFM_FANOUT;
+
+ root->rnode->parent = &nroot->b;
+ root->rnode = &nroot->b.b;
+
+ root->maxval = bfm_maxval_shift(nroot->b.b.node_shift);
+ }
+
+ Assert(nroot != NULL);
+
+ return bfm_set_extend(root, key, val, &nroot->b,
+ shift, (key >> shift) & BFM_MASK);
+}
+
+static void
+bfm_delete_inner(bfm_tree * pg_restrict root, bfm_tree_node_inner * pg_restrict node, bfm_tree_node *pg_restrict child, int child_chunk)
+{
+ switch((bfm_tree_node_kind) node->b.kind)
+ {
+ case BFM_KIND_1:
+ {
+ bfm_tree_node_inner_1 *node_1 =
+ (bfm_tree_node_inner_1 *) node;
+
+ Assert(node_1->slot == child);
+ Assert(node_1->chunk == child_chunk);
+
+ node_1->chunk = 17;
+ node_1->slot = NULL;
+
+ break;
+ }
+
+ case BFM_KIND_4:
+ {
+ bfm_tree_node_inner_4 *node_4 =
+ (bfm_tree_node_inner_4 *) node;
+ int index;
+
+ index = search_chunk_array_4_eq(node_4->chunks, child_chunk, node_4->b.b.count);
+ Assert(index != -1);
+
+ Assert(node_4->slots[index] == child);
+ memmove(&node_4->slots[index],
+ &node_4->slots[index + 1],
+ (node_4->b.b.count-index-1) * sizeof(void*));
+ memmove(&node_4->chunks[index],
+ &node_4->chunks[index + 1],
+ node_4->b.b.count-index-1);
+
+ node_4->chunks[node_4->b.b.count - 1] = BFM_TREE_NODE_INNER_4_INVALID;
+ node_4->slots[node_4->b.b.count - 1] = NULL;
+
+ break;
+ }
+
+ case BFM_KIND_16:
+ {
+ bfm_tree_node_inner_16 *node_16 =
+ (bfm_tree_node_inner_16 *) node;
+ int index;
+
+ index = search_chunk_array_16_eq(node_16->chunks, child_chunk, node_16->b.b.count);
+ Assert(index != -1);
+
+ Assert(node_16->slots[index] == child);
+ memmove(&node_16->slots[index],
+ &node_16->slots[index + 1],
+ (node_16->b.b.count - index - 1) * sizeof(node_16->slots[0]));
+ memmove(&node_16->chunks[index],
+ &node_16->chunks[index + 1],
+ (node_16->b.b.count - index - 1) * sizeof(node_16->chunks[0]));
+
+ node_16->chunks[node_16->b.b.count - 1] = BFM_TREE_NODE_INNER_16_INVALID;
+ node_16->slots[node_16->b.b.count - 1] = NULL;
+
+ break;
+ }
+
+ case BFM_KIND_32:
+ {
+ bfm_tree_node_inner_32 *node_32 =
+ (bfm_tree_node_inner_32 *) node;
+ int index;
+
+ index = search_chunk_array_32_eq(node_32->chunks, child_chunk, node_32->b.b.count);
+ Assert(index != -1);
+
+ Assert(node_32->slots[index] == child);
+ memmove(&node_32->slots[index],
+ &node_32->slots[index + 1],
+ (node_32->b.b.count - index - 1) * sizeof(node_32->slots[0]));
+ memmove(&node_32->chunks[index],
+ &node_32->chunks[index + 1],
+ (node_32->b.b.count - index - 1) * sizeof(node_32->chunks[0]));
+
+ node_32->chunks[node_32->b.b.count - 1] = BFM_TREE_NODE_INNER_32_INVALID;
+ node_32->slots[node_32->b.b.count - 1] = NULL;
+
+ break;
+ }
+
+ case BFM_KIND_128:
+ {
+ bfm_tree_node_inner_128 *node_128 =
+ (bfm_tree_node_inner_128 *) node;
+ uint8 offset;
+
+ offset = node_128->offsets[child_chunk];
+ Assert(offset != BFM_TREE_NODE_128_INVALID);
+ Assert(node_128->slots[offset] == child);
+ node_128->offsets[child_chunk] = BFM_TREE_NODE_128_INVALID;
+ node_128->slots[offset] = NULL;
+ break;
+ }
+
+ case BFM_KIND_MAX:
+ {
+ bfm_tree_node_inner_max *node_max =
+ (bfm_tree_node_inner_max *) node;
+
+ Assert(node_max->slots[child_chunk] == child);
+ node_max->slots[child_chunk] = NULL;
+
+ break;
+ }
+ }
+
+ node->b.count--;
+
+ if (node->b.count == 0)
+ {
+ if (node->b.parent)
+ bfm_delete_inner(root, node->b.parent, &node->b, node->b.node_chunk);
+ else
+ root->rnode = NULL;
+ bfm_free_inner(root, node);
+ }
+}
+
+/*
+ * NB: After this call node cannot be used anymore, it may have been freed or
+ * shrunk.
+ *
+ * FIXME: this should implement shrinking of nodes
+ */
+static void pg_noinline
+bfm_delete_leaf(bfm_tree * pg_restrict root, bfm_tree_node_leaf *pg_restrict node, int child_chunk)
+{
+ /* tell the compiler it doesn't need a bounds check */
+ if ((bfm_tree_node_kind) node->b.kind > BFM_KIND_MAX)
+ pg_unreachable();
+
+ switch((bfm_tree_node_kind) node->b.kind)
+ {
+ case BFM_KIND_1:
+ {
+ bfm_tree_node_leaf_1 *node_1 =
+ (bfm_tree_node_leaf_1 *) node;
+
+ Assert(node_1->chunk == child_chunk);
+
+ node_1->chunk = 17;
+ break;
+ }
+
+ case BFM_KIND_4:
+ {
+ bfm_tree_node_leaf_4 *node_4 =
+ (bfm_tree_node_leaf_4 *) node;
+ int index;
+
+ index = search_chunk_array_4_eq(node_4->chunks, child_chunk, node_4->b.b.count);
+ Assert(index != -1);
+
+ memmove(&node_4->values[index],
+ &node_4->values[index + 1],
+ (node_4->b.b.count - index - 1) * sizeof(node_4->values[0]));
+ memmove(&node_4->chunks[index],
+ &node_4->chunks[index + 1],
+ (node_4->b.b.count - index - 1) * sizeof(node_4->chunks[0]));
+
+ node_4->chunks[node_4->b.b.count - 1] = BFM_TREE_NODE_INNER_4_INVALID;
+ node_4->values[node_4->b.b.count - 1] = 0xFF;
+
+ break;
+ }
+
+ case BFM_KIND_16:
+ {
+ bfm_tree_node_leaf_16 *node_16 =
+ (bfm_tree_node_leaf_16 *) node;
+ int index;
+
+ index = search_chunk_array_16_eq(node_16->chunks, child_chunk, node_16->b.b.count);
+ Assert(index != -1);
+
+ memmove(&node_16->values[index],
+ &node_16->values[index + 1],
+ (node_16->b.b.count - index - 1) * sizeof(node_16->values[0]));
+ memmove(&node_16->chunks[index],
+ &node_16->chunks[index + 1],
+ (node_16->b.b.count - index - 1) * sizeof(node_16->chunks[0]));
+
+ node_16->chunks[node_16->b.b.count - 1] = BFM_TREE_NODE_INNER_16_INVALID;
+ node_16->values[node_16->b.b.count - 1] = 0xFF;
+
+ break;
+ }
+
+ case BFM_KIND_32:
+ {
+ bfm_tree_node_leaf_32 *node_32 =
+ (bfm_tree_node_leaf_32 *) node;
+ int index;
+
+ index = search_chunk_array_32_eq(node_32->chunks, child_chunk, node_32->b.b.count);
+ Assert(index != -1);
+
+ memmove(&node_32->values[index],
+ &node_32->values[index + 1],
+ (node_32->b.b.count - index - 1) * sizeof(node_32->values[0]));
+ memmove(&node_32->chunks[index],
+ &node_32->chunks[index + 1],
+ (node_32->b.b.count - index - 1) * sizeof(node_32->chunks[0]));
+
+ node_32->chunks[node_32->b.b.count - 1] = BFM_TREE_NODE_INNER_32_INVALID;
+ node_32->values[node_32->b.b.count - 1] = 0xFF;
+
+ break;
+ }
+
+ case BFM_KIND_128:
+ {
+ bfm_tree_node_leaf_128 *node_128 =
+ (bfm_tree_node_leaf_128 *) node;
+
+ Assert(node_128->offsets[child_chunk] != BFM_TREE_NODE_128_INVALID);
+ node_128->offsets[child_chunk] = BFM_TREE_NODE_128_INVALID;
+ break;
+ }
+
+ case BFM_KIND_MAX:
+ {
+ bfm_tree_node_leaf_max *node_max =
+ (bfm_tree_node_leaf_max *) node;
+
+ Assert(bfm_leaf_max_isset(node_max, child_chunk));
+ bfm_leaf_max_unset(node_max, child_chunk);
+
+ break;
+ }
+ }
+
+#ifdef BFM_STATS
+ root->entries--;
+#endif
+ node->b.count--;
+
+ if (node->b.count == 0)
+ {
+ if (node->b.parent)
+ bfm_delete_inner(root, node->b.parent, &node->b, node->b.node_chunk);
+ else
+ root->rnode = NULL;
+ bfm_free_leaf(root, node);
+ }
+}
+
+void
+bfm_init(bfm_tree *root)
+{
+ memset(root, 0, sizeof(*root));
+
+#if 1
+ root->context = AllocSetContextCreate(CurrentMemoryContext, "radix bench internal",
+ ALLOCSET_DEFAULT_SIZES);
+#else
+ root->context = CurrentMemoryContext;
+#endif
+
+#ifdef BFM_USE_SLAB
+ for (int i = 0; i < BFM_KIND_COUNT; i++)
+ {
+ root->inner_slabs[i] = SlabContextCreate(root->context,
+ inner_class_info[i].name,
+ Max(pg_nextpower2_32((MAXALIGN(inner_class_info[i].size) + 16) * 32), 1024),
+ inner_class_info[i].size);
+ root->leaf_slabs[i] = SlabContextCreate(root->context,
+ leaf_class_info[i].name,
+ Max(pg_nextpower2_32((MAXALIGN(leaf_class_info[i].size) + 16) * 32), 1024),
+ leaf_class_info[i].size);
+#if 0
+ elog(LOG, "%s %s size original %zu, mult %zu, round %u",
+ "leaf",
+ leaf_class_info[i].name,
+ leaf_class_info[i].size,
+ leaf_class_info[i].size * 32,
+ pg_nextpower2_32(leaf_class_info[i].size * 32));
+#endif
+ }
+#endif
+
+ /*
+ * XXX: Might be worth to always allocate a root node, to avoid related
+ * branches?
+ */
+}
+
+bool
+bfm_lookup(bfm_tree *root, uint64_t key, bfm_value_type *val)
+{
+ bfm_tree_node *node;
+
+ return bfm_walk(root, &node, val, key);
+}
+
+/*
+ * Set key to val. Returns false if entry doesn't yet exist, true if it did.
+ */
+bool
+bfm_set(bfm_tree *root, bfm_key_type key, bfm_value_type val)
+{
+ bfm_tree_node *cur;
+ bfm_tree_node_leaf *target;
+ uint8 chunk;
+ uint32 shift;
+
+ if (unlikely(!root->rnode))
+ return bfm_set_empty(root, key, val);
+ else if (key > root->maxval)
+ return bfm_set_shallow(root, key, val);
+
+ shift = root->rnode->node_shift;
+ chunk = (key >> shift) & BFM_MASK;
+ cur = root->rnode;
+
+ while (shift > 0)
+ {
+ bfm_tree_node_inner *cur_inner;
+ bfm_tree_node *slot;
+
+ Assert(cur->node_shift > 0); /* leaf nodes look different */
+ Assert(cur->node_shift == shift);
+
+ cur_inner = (bfm_tree_node_inner *) cur;
+
+ slot = bfm_find_one_level_inner(cur_inner, chunk);
+
+ if (slot == NULL)
+ return bfm_set_extend(root, key, val, cur_inner, shift, chunk);
+
+ Assert(&slot->parent->b == cur);
+ Assert(slot->node_chunk == chunk);
+
+ cur = slot;
+ shift -= BFM_FANOUT;
+ chunk = (key >> shift) & BFM_MASK;
+ }
+
+ Assert(shift == 0 && cur->node_shift == 0);
+
+ target = (bfm_tree_node_leaf *) cur;
+
+ /*
+ * FIXME: what is the best API to deal with existing values? Overwrite?
+ * Overwrite and return old value? Just return true?
+ */
+ return bfm_set_leaf(root, key, val, target, chunk);
+}
+
+bool
+bfm_delete(bfm_tree *root, uint64 key)
+{
+ bfm_tree_node *node;
+ bfm_value_type val;
+
+ if (!bfm_walk(root, &node, &val, key))
+ return false;
+
+ Assert(node != NULL && node->node_shift == 0);
+
+ /* recurses upwards and deletes parent nodes if necessary */
+ bfm_delete_leaf(root, (bfm_tree_node_leaf *) node, key & BFM_MASK);
+
+ return true;
+}
+
+
+StringInfo
+bfm_stats(bfm_tree *root)
+{
+ StringInfo s;
+#ifdef BFM_STATS
+ size_t total;
+ size_t inner_bytes;
+ size_t leaf_bytes;
+ size_t allocator_bytes;
+#endif
+
+ s = makeStringInfo();
+
+ /* FIXME: Some of the below could be printed even without BFM_STATS */
+#ifdef BFM_STATS
+ appendStringInfo(s, "%zu entries and depth %d\n",
+ root->entries,
+ root->rnode ? root->rnode->node_shift / BFM_FANOUT : 0);
+
+ {
+ appendStringInfo(s, "\tinner nodes:");
+ total = 0;
+ inner_bytes = 0;
+ for (int i = 0; i < BFM_KIND_COUNT; i++)
+ {
+ total += root->inner_nodes[i];
+ inner_bytes += inner_class_info[i].size * root->inner_nodes[i];
+ appendStringInfo(s, " %s: %zu, ",
+ inner_class_info[i].name,
+ root->inner_nodes[i]);
+ }
+ appendStringInfo(s, " total: %zu, total_bytes: %zu\n", total,
+ inner_bytes);
+ }
+
+ {
+ appendStringInfo(s, "\tleaf nodes:");
+ total = 0;
+ leaf_bytes = 0;
+ for (int i = 0; i < BFM_KIND_COUNT; i++)
+ {
+ total += root->leaf_nodes[i];
+ leaf_bytes += leaf_class_info[i].size * root->leaf_nodes[i];
+ appendStringInfo(s, " %s: %zu, ",
+ leaf_class_info[i].name,
+ root->leaf_nodes[i]);
+ }
+ appendStringInfo(s, " total: %zu, total_bytes: %zu\n", total,
+ leaf_bytes);
+ }
+
+ allocator_bytes = MemoryContextMemAllocated(root->context, true);
+
+ appendStringInfo(s, "\t%.2f MB excluding allocator overhead, %.2f MiB including\n",
+ (inner_bytes + leaf_bytes) / (double) (1024 * 1024),
+ allocator_bytes / (double) (1024 * 1024));
+ appendStringInfo(s, "\t%.2f bytes/entry excluding allocator overhead\n",
+ root->entries > 0 ?
+ (inner_bytes + leaf_bytes)/(double)root->entries : 0);
+ appendStringInfo(s, "\t%.2f bytes/entry including allocator overhead\n",
+ root->entries > 0 ?
+ allocator_bytes/(double)root->entries : 0);
+#endif
+
+ if (0)
+ bfm_print(root);
+
+ return s;
+}
+
+static void
+bfm_print_node(StringInfo s, int indent, bfm_value_type key, bfm_tree_node *node);
+
+static void
+bfm_print_node_child(StringInfo s, int indent, bfm_value_type key, bfm_tree_node *node,
+ int i, uint8 chunk, bfm_tree_node *child)
+{
+ appendStringInfoSpaces(s, indent + 2);
+ appendStringInfo(s, "%u: child chunk: 0x%.2X, child: %p\n",
+ i, chunk, child);
+ key |= ((uint64) chunk) << node->node_shift;
+
+ bfm_print_node(s, indent + 4, key, child);
+}
+
+static void
+bfm_print_value(StringInfo s, int indent, bfm_value_type key, bfm_tree_node *node,
+ int i, uint8 chunk, bfm_value_type value)
+{
+ key |= chunk;
+
+ appendStringInfoSpaces(s, indent + 2);
+ appendStringInfo(s, "%u: chunk: 0x%.2X, key: 0x%llX/%llu, value: 0x%llX/%llu\n",
+ i,
+ chunk,
+ (unsigned long long) key,
+ (unsigned long long) key,
+ (unsigned long long) value,
+ (unsigned long long) value);
+}
+
+static void
+bfm_print_node(StringInfo s, int indent, bfm_value_type key, bfm_tree_node *node)
+{
+ appendStringInfoSpaces(s, indent);
+ appendStringInfo(s, "%s: kind %d, children: %u, shift: %u, node chunk: 0x%.2X, partial key: 0x%llX\n",
+ node->node_shift != 0 ? "inner" : "leaf",
+ node->kind,
+ node->count,
+ node->node_shift,
+ node->node_chunk,
+ (long long unsigned) key);
+
+ if (node->node_shift != 0)
+ {
+ bfm_tree_node_inner *inner = (bfm_tree_node_inner *) node;
+
+ switch((bfm_tree_node_kind) inner->b.kind)
+ {
+ case BFM_KIND_1:
+ {
+ bfm_tree_node_inner_1 *node_1 =
+ (bfm_tree_node_inner_1 *) node;
+
+ if (node_1->b.b.count > 0)
+ bfm_print_node_child(s, indent, key, node,
+ 0, node_1->chunk, node_1->slot);
+
+ break;
+ }
+
+ case BFM_KIND_4:
+ {
+ bfm_tree_node_inner_4 *node_4 =
+ (bfm_tree_node_inner_4 *) node;
+
+ for (int i = 0; i < node_4->b.b.count; i++)
+ {
+ bfm_print_node_child(s, indent, key, node,
+ i, node_4->chunks[i], node_4->slots[i]);
+ }
+
+ break;
+ }
+
+ case BFM_KIND_16:
+ {
+ bfm_tree_node_inner_16 *node_16 =
+ (bfm_tree_node_inner_16 *) node;
+
+ for (int i = 0; i < node_16->b.b.count; i++)
+ {
+ bfm_print_node_child(s, indent, key, node,
+ i, node_16->chunks[i], node_16->slots[i]);
+ }
+
+ break;
+ }
+
+ case BFM_KIND_32:
+ {
+ bfm_tree_node_inner_32 *node_32 =
+ (bfm_tree_node_inner_32 *) node;
+
+ for (int i = 0; i < node_32->b.b.count; i++)
+ {
+ bfm_print_node_child(s, indent, key, node,
+ i, node_32->chunks[i], node_32->slots[i]);
+ }
+
+ break;
+ }
+
+ case BFM_KIND_128:
+ {
+ bfm_tree_node_inner_128 *node_128 =
+ (bfm_tree_node_inner_128 *) node;
+
+ for (int i = 0; i < BFM_MAX_CLASS; i++)
+ {
+ uint8 offset = node_128->offsets[i];
+
+ if (offset == BFM_TREE_NODE_128_INVALID)
+ continue;
+
+ bfm_print_node_child(s, indent, key, node,
+ offset, i, node_128->slots[offset]);
+ }
+
+ break;
+ }
+
+ case BFM_KIND_MAX:
+ {
+ bfm_tree_node_inner_max *node_max =
+ (bfm_tree_node_inner_max *) node;
+
+ for (int i = 0; i < BFM_MAX_CLASS; i++)
+ {
+ if (node_max->slots[i] == NULL)
+ continue;
+
+ bfm_print_node_child(s, indent, key, node,
+ i, i, node_max->slots[i]);
+ }
+
+ break;
+ }
+ }
+ }
+ else
+ {
+ bfm_tree_node_leaf *leaf = (bfm_tree_node_leaf *) node;
+
+ switch((bfm_tree_node_kind) leaf->b.kind)
+ {
+ case BFM_KIND_1:
+ {
+ bfm_tree_node_leaf_1 *node_1 =
+ (bfm_tree_node_leaf_1 *) node;
+
+ if (node_1->b.b.count > 0)
+ bfm_print_value(s, indent, key, node,
+ 0, node_1->chunk, node_1->value);
+
+ break;
+ }
+
+ case BFM_KIND_4:
+ {
+ bfm_tree_node_leaf_4 *node_4 =
+ (bfm_tree_node_leaf_4 *) node;
+
+ for (int i = 0; i < node_4->b.b.count; i++)
+ {
+ bfm_print_value(s, indent, key, node,
+ i, node_4->chunks[i], node_4->values[i]);
+ }
+
+ break;
+ }
+
+ case BFM_KIND_16:
+ {
+ bfm_tree_node_leaf_16 *node_16 =
+ (bfm_tree_node_leaf_16 *) node;
+
+ for (int i = 0; i < node_16->b.b.count; i++)
+ {
+ bfm_print_value(s, indent, key, node,
+ i, node_16->chunks[i], node_16->values[i]);
+ }
+
+ break;
+ }
+
+ case BFM_KIND_32:
+ {
+ bfm_tree_node_leaf_32 *node_32 =
+ (bfm_tree_node_leaf_32 *) node;
+
+ for (int i = 0; i < node_32->b.b.count; i++)
+ {
+ bfm_print_value(s, indent, key, node,
+ i, node_32->chunks[i], node_32->values[i]);
+ }
+
+ break;
+ }
+
+ case BFM_KIND_128:
+ {
+ bfm_tree_node_leaf_128 *node_128 =
+ (bfm_tree_node_leaf_128 *) node;
+
+ for (int i = 0; i < BFM_MAX_CLASS; i++)
+ {
+ uint8 offset = node_128->offsets[i];
+
+ if (offset == BFM_TREE_NODE_128_INVALID)
+ continue;
+
+ bfm_print_value(s, indent, key, node,
+ offset, i, node_128->values[offset]);
+ }
+
+ break;
+ }
+
+ case BFM_KIND_MAX:
+ {
+ bfm_tree_node_leaf_max *node_max =
+ (bfm_tree_node_leaf_max *) node;
+
+ for (int i = 0; i < BFM_MAX_CLASS; i++)
+ {
+ if (!bfm_leaf_max_isset(node_max, i))
+ continue;
+
+ bfm_print_value(s, indent, key, node,
+ i, i, node_max->values[i]);
+ }
+
+ break;
+ }
+ }
+ }
+}
+
+void
+bfm_print(bfm_tree *root)
+{
+ StringInfoData s;
+
+ initStringInfo(&s);
+
+ if (root->rnode)
+ bfm_print_node(&s, 0 /* indent */, 0 /* key */, root->rnode);
+
+ elog(LOG, "radix debug print:\n%s", s.data);
+ pfree(s.data);
+}
+
+
+#define EXPECT_TRUE(expr) \
+ do { \
+ if (!(expr)) \
+ elog(ERROR, \
+ "%s was unexpectedly false in file \"%s\" line %u", \
+ #expr, __FILE__, __LINE__); \
+ } while (0)
+
+#define EXPECT_FALSE(expr) \
+ do { \
+ if (expr) \
+ elog(ERROR, \
+ "%s was unexpectedly true in file \"%s\" line %u", \
+ #expr, __FILE__, __LINE__); \
+ } while (0)
+
+#define EXPECT_EQ_U32(result_expr, expected_expr) \
+ do { \
+ uint32 result = (result_expr); \
+ uint32 expected = (expected_expr); \
+ if (result != expected) \
+ elog(ERROR, \
+ "%s yielded %u, expected %s in file \"%s\" line %u", \
+ #result_expr, result, #expected_expr, __FILE__, __LINE__); \
+ } while (0)
+
+static void
+bfm_test_insert_leaf_grow(bfm_tree *root)
+{
+ bfm_value_type val;
+
+ /* 0->1 */
+ EXPECT_FALSE(bfm_set(root, 0, 0+3));
+ EXPECT_TRUE(bfm_lookup(root, 0, &val));
+ EXPECT_EQ_U32(val, 0+3);
+
+ /* node 1->4 */
+ for (int i = 1; i < 4; i++)
+ {
+ EXPECT_FALSE(bfm_set(root, i, i+3));
+ }
+ for (int i = 0; i < 4; i++)
+ {
+ EXPECT_TRUE(bfm_lookup(root, i, &val));
+ EXPECT_EQ_U32(val, i+3);
+ }
+
+ /* node 4->16, reverse order, for giggles */
+ for (int i = 15; i >= 4; i--)
+ {
+ EXPECT_FALSE(bfm_set(root, i, i+3));
+ }
+ for (int i = 0; i < 16; i++)
+ {
+ EXPECT_TRUE(bfm_lookup(root, i, &val));
+ EXPECT_EQ_U32(val, i+3);
+ }
+
+ /* node 16->32 */
+ for (int i = 16; i < 32; i++)
+ {
+ EXPECT_FALSE(bfm_set(root, i, i+3));
+ }
+ for (int i = 0; i < 32; i++)
+ {
+ EXPECT_TRUE(bfm_lookup(root, i, &val));
+ EXPECT_EQ_U32(val, i+3);
+ }
+
+ /* node 32->128 */
+ for (int i = 32; i < 128; i++)
+ {
+ EXPECT_FALSE(bfm_set(root, i, i+3));
+ }
+ for (int i = 0; i < 128; i++)
+ {
+ EXPECT_TRUE(bfm_lookup(root, i, &val));
+ EXPECT_EQ_U32(val, i+3);
+ }
+
+ /* node 128->max */
+ for (int i = 128; i < BFM_MAX_CLASS; i++)
+ {
+ EXPECT_FALSE(bfm_set(root, i, i+3));
+ }
+ for (int i = 0; i < BFM_MAX_CLASS; i++)
+ {
+ EXPECT_TRUE(bfm_lookup(root, i, &val));
+ EXPECT_EQ_U32(val, i+3);
+ }
+
+}
+
+static void
+bfm_test_insert_inner_grow(void)
+{
+ bfm_tree root;
+ bfm_value_type val;
+ bfm_value_type cur;
+
+ bfm_init(&root);
+
+ cur = 1025;
+
+ while (!root.rnode ||
+ root.rnode->node_shift == 0 ||
+ root.rnode->count < 4)
+ {
+ EXPECT_FALSE(bfm_set(&root, cur, -cur));
+ cur += BFM_MAX_CLASS;
+ }
+
+ for (int i = 1025; i < cur; i += BFM_MAX_CLASS)
+ {
+ EXPECT_TRUE(bfm_lookup(&root, i, &val));
+ EXPECT_EQ_U32(val, -i);
+ }
+
+ while (root.rnode->count < 32)
+ {
+ EXPECT_FALSE(bfm_set(&root, cur, -cur));
+ cur += BFM_MAX_CLASS;
+ }
+
+ for (int i = 1025; i < cur; i += BFM_MAX_CLASS)
+ {
+ EXPECT_TRUE(bfm_lookup(&root, i, &val));
+ EXPECT_EQ_U32(val, -i);
+ }
+
+ while (root.rnode->count < 128)
+ {
+ EXPECT_FALSE(bfm_set(&root, cur, -cur));
+ cur += BFM_MAX_CLASS;
+ }
+
+ for (int i = 1025; i < cur; i += BFM_MAX_CLASS)
+ {
+ EXPECT_TRUE(bfm_lookup(&root, i, &val));
+ EXPECT_EQ_U32(val, -i);
+ }
+
+ while (root.rnode->count < BFM_MAX_CLASS)
+ {
+ EXPECT_FALSE(bfm_set(&root, cur, -cur));
+ cur += BFM_MAX_CLASS;
+ }
+
+ for (int i = 1025; i < cur; i += BFM_MAX_CLASS)
+ {
+ EXPECT_TRUE(bfm_lookup(&root, i, &val));
+ EXPECT_EQ_U32(val, -i);
+ }
+
+ while (root.rnode->count == BFM_MAX_CLASS)
+ {
+ EXPECT_FALSE(bfm_set(&root, cur, -cur));
+ cur += BFM_MAX_CLASS;
+ }
+
+ for (int i = 1025; i < cur; i += BFM_MAX_CLASS)
+ {
+ EXPECT_TRUE(bfm_lookup(&root, i, &val));
+ EXPECT_EQ_U32(val, -i);
+ }
+
+}
+
+static void
+bfm_test_delete_lots(void)
+{
+ bfm_tree root;
+ bfm_value_type val;
+ bfm_key_type insertval;
+
+ bfm_init(&root);
+
+ insertval = 0;
+ while (!root.rnode ||
+ root.rnode->node_shift != (BFM_FANOUT * 2))
+ {
+ EXPECT_FALSE(bfm_set(&root, insertval, -insertval));
+ insertval++;
+ }
+
+ for (bfm_key_type i = 0; i < insertval; i++)
+ {
+ EXPECT_TRUE(bfm_lookup(&root, i, &val));
+ EXPECT_EQ_U32(val, -i);
+ EXPECT_TRUE(bfm_delete(&root, i));
+ EXPECT_FALSE(bfm_lookup(&root, i, &val));
+ }
+
+ EXPECT_TRUE(root.rnode == NULL);
+}
+
+#include "portability/instr_time.h"
+
+static void
+bfm_test_insert_bulk(int count)
+{
+ bfm_tree root;
+ bfm_value_type val;
+ instr_time start, end, diff;
+ int misses;
+ int mult = 1;
+
+ bfm_init(&root);
+
+ INSTR_TIME_SET_CURRENT(start);
+
+ for (int i = 0; i < count; i++)
+ bfm_set(&root, i*mult, -i);
+
+ INSTR_TIME_SET_CURRENT(end);
+ INSTR_TIME_SET_ZERO(diff);
+ INSTR_TIME_ACCUM_DIFF(diff, end, start);
+
+ elog(NOTICE, "%d ordered insertions in %f seconds, %d/sec",
+ count,
+ INSTR_TIME_GET_DOUBLE(diff),
+ (int)(count/INSTR_TIME_GET_DOUBLE(diff)));
+
+ INSTR_TIME_SET_CURRENT(start);
+
+ misses = 0;
+ for (int i = 0; i < count; i++)
+ {
+ if (unlikely(!bfm_lookup(&root, i*mult, &val)))
+ misses++;
+ }
+ if (misses > 0)
+ elog(ERROR, "not present for lookup: %d entries", misses);
+
+ INSTR_TIME_SET_CURRENT(end);
+ INSTR_TIME_SET_ZERO(diff);
+ INSTR_TIME_ACCUM_DIFF(diff, end, start);
+
+ elog(NOTICE, "%d ordered lookups in %f seconds, %d/sec",
+ count,
+ INSTR_TIME_GET_DOUBLE(diff),
+ (int)(count/INSTR_TIME_GET_DOUBLE(diff)));
+
+ elog(LOG, "stats after lookup are: %s",
+ bfm_stats(&root)->data);
+
+ INSTR_TIME_SET_CURRENT(start);
+
+ misses = 0;
+ for (int i = 0; i < count; i++)
+ {
+ if (unlikely(!bfm_delete(&root, i*mult)))
+ misses++;
+ }
+ if (misses > 0)
+ elog(ERROR, "not present for deletion: %d entries", misses);
+
+ INSTR_TIME_SET_CURRENT(end);
+ INSTR_TIME_SET_ZERO(diff);
+ INSTR_TIME_ACCUM_DIFF(diff, end, start);
+
+ elog(NOTICE, "%d ordered deletions in %f seconds, %d/sec",
+ count,
+ INSTR_TIME_GET_DOUBLE(diff),
+ (int)(count/INSTR_TIME_GET_DOUBLE(diff)));
+
+ elog(LOG, "stats after deletion are: %s",
+ bfm_stats(&root)->data);
+}
+
+void
+bfm_tests(void)
+{
+ bfm_tree root;
+ bfm_value_type val;
+
+ /* initialize a tree starting with a large value */
+ bfm_init(&root);
+ EXPECT_FALSE(bfm_set(&root, 1024, 1));
+ EXPECT_TRUE(bfm_lookup(&root, 1024, &val));
+ EXPECT_EQ_U32(val, 1);
+ /* there should only be the key we inserted */
+#ifdef BFM_STATS
+ EXPECT_EQ_U32(root.leaf_nodes[0], 1);
+#endif
+
+ /* check that we can subsequently insert a small value */
+ EXPECT_FALSE(bfm_set(&root, 1, 2));
+ EXPECT_TRUE(bfm_lookup(&root, 1, &val));
+ EXPECT_EQ_U32(val, 2);
+ EXPECT_TRUE(bfm_lookup(&root, 1024, &val));
+ EXPECT_EQ_U32(val, 1);
+
+ /* check that a 0 key and 0 value are correctly recognized */
+ bfm_init(&root);
+ EXPECT_FALSE(bfm_lookup(&root, 0, &val));
+ EXPECT_FALSE(bfm_set(&root, 0, 17));
+ EXPECT_TRUE(bfm_lookup(&root, 0, &val));
+ EXPECT_EQ_U32(val, 17);
+
+ EXPECT_FALSE(bfm_lookup(&root, 2, &val));
+ EXPECT_FALSE(bfm_set(&root, 2, 0));
+ EXPECT_TRUE(bfm_lookup(&root, 2, &val));
+ EXPECT_EQ_U32(val, 0);
+
+ /* check that repeated insertion of the same key updates value */
+ bfm_init(&root);
+ EXPECT_FALSE(bfm_set(&root, 9, 12));
+ EXPECT_TRUE(bfm_lookup(&root, 9, &val));
+ EXPECT_EQ_U32(val, 12);
+ EXPECT_TRUE(bfm_set(&root, 9, 13));
+ EXPECT_TRUE(bfm_lookup(&root, 9, &val));
+ EXPECT_EQ_U32(val, 13);
+
+
+ /* initialize a tree starting with a leaf value */
+ bfm_init(&root);
+ EXPECT_FALSE(bfm_set(&root, 3, 1));
+ EXPECT_TRUE(bfm_lookup(&root, 3, &val));
+ EXPECT_EQ_U32(val, 1);
+ /* there should only be the key we inserted */
+#ifdef BFM_STATS
+ EXPECT_EQ_U32(root.leaf_nodes[0], 1);
+#endif
+ /* and no inner ones */
+#ifdef BFM_STATS
+ EXPECT_EQ_U32(root.inner_nodes[0], 0);
+#endif
+
+ EXPECT_FALSE(bfm_set(&root, 1717, 17));
+ EXPECT_TRUE(bfm_lookup(&root, 1717, &val));
+ EXPECT_EQ_U32(val, 17);
+
+ /* check that a root leaf node grows correctly */
+ bfm_init(&root);
+ bfm_test_insert_leaf_grow(&root);
+
+ /* check that a non-root leaf node grows correctly */
+ bfm_init(&root);
+ EXPECT_FALSE(bfm_set(&root, 1024, 1024));
+ bfm_test_insert_leaf_grow(&root);
+
+ /* check that an inner node grows correctly */
+ bfm_test_insert_inner_grow();
+
+
+ bfm_init(&root);
+ EXPECT_FALSE(bfm_set(&root, 1, 1));
+ EXPECT_TRUE(bfm_lookup(&root, 1, &val));
+
+ /* deletion from leaf node at root */
+ EXPECT_TRUE(bfm_delete(&root, 1));
+ EXPECT_FALSE(bfm_lookup(&root, 1, &val));
+
+ /* repeated deletion fails */
+ EXPECT_FALSE(bfm_delete(&root, 1));
+ EXPECT_TRUE(root.rnode == NULL);
+
+ /* one deletion doesn't disturb other values in leaf */
+ EXPECT_FALSE(bfm_set(&root, 1, 1));
+ EXPECT_FALSE(bfm_set(&root, 2, 2));
+ EXPECT_TRUE(bfm_delete(&root, 1));
+ EXPECT_FALSE(bfm_lookup(&root, 1, &val));
+ EXPECT_TRUE(bfm_lookup(&root, 2, &val));
+ EXPECT_EQ_U32(val, 2);
+
+ EXPECT_TRUE(bfm_delete(&root, 2));
+ EXPECT_FALSE(bfm_lookup(&root, 2, &val));
+ EXPECT_TRUE(root.rnode == NULL);
+
+ /* deletion from a leaf node succeeds */
+ EXPECT_FALSE(bfm_set(&root, 0xFFFF02, 0xFFFF02));
+ EXPECT_FALSE(bfm_set(&root, 1, 1));
+ EXPECT_FALSE(bfm_set(&root, 2, 2));
+
+ EXPECT_TRUE(bfm_delete(&root, 1));
+ EXPECT_TRUE(bfm_lookup(&root, 0xFFFF02, &val));
+ EXPECT_FALSE(bfm_lookup(&root, 1, &val));
+ EXPECT_TRUE(bfm_lookup(&root, 2, &val));
+
+ EXPECT_TRUE(bfm_delete(&root, 2));
+ EXPECT_TRUE(bfm_lookup(&root, 0xFFFF02, &val));
+ EXPECT_FALSE(bfm_lookup(&root, 1, &val));
+
+ EXPECT_TRUE(bfm_delete(&root, 0xFFFF02));
+ EXPECT_FALSE(bfm_delete(&root, 0xFFFF02));
+ EXPECT_FALSE(bfm_lookup(&root, 0xFFFF02, &val));
+ EXPECT_TRUE(root.rnode == NULL);
+
+ /* check that repeatedly inserting and deleting the same value works */
+ bfm_init(&root);
+ EXPECT_FALSE(bfm_set(&root, 0x10000, -0x10000));
+ EXPECT_FALSE(bfm_set(&root, 0, 0));
+ EXPECT_TRUE(bfm_lookup(&root, 0, &val));
+ EXPECT_TRUE(bfm_delete(&root, 0));
+ EXPECT_FALSE(bfm_lookup(&root, 0, &val));
+ EXPECT_FALSE(bfm_set(&root, 0, 0));
+ EXPECT_TRUE(bfm_set(&root, 0, 0));
+ EXPECT_TRUE(bfm_lookup(&root, 0, &val));
+
+ bfm_test_delete_lots();
+
+ if (0)
+ {
+ int cnt = 300;
+
+ bfm_init(&root);
+ MemoryContextStats(root.context);
+ for (int i = 0; i < cnt; i++)
+ EXPECT_FALSE(bfm_set(&root, i, i));
+ MemoryContextStats(root.context);
+ for (int i = 0; i < cnt; i++)
+ EXPECT_TRUE(bfm_delete(&root, i));
+ MemoryContextStats(root.context);
+ }
+
+ if (1)
+ {
+ //bfm_test_insert_bulk( 100 * 1000);
+ //bfm_test_insert_bulk( 1000 * 1000);
+#ifdef USE_ASSERT_CHECKING
+ bfm_test_insert_bulk( 1 * 1000 * 1000);
+#endif
+ //bfm_test_insert_bulk( 10 * 1000 * 1000);
+#ifndef USE_ASSERT_CHECKING
+ bfm_test_insert_bulk( 100 * 1000 * 1000);
+#endif
+ //bfm_test_insert_bulk(1000 * 1000 * 1000);
+ }
+
+ //bfm_print(&root);
+}
diff --git a/bdbench/radix.h b/bdbench/radix.h
new file mode 100644
index 0000000..c908aa5
--- /dev/null
+++ b/bdbench/radix.h
@@ -0,0 +1,76 @@
+/*-------------------------------------------------------------------------
+ *
+ * radix.h
+ * radix tree, yay.
+ *
+ *
+ * Portions Copyright (c) 2014-2021, PostgreSQL Global Development Group
+ *
+ * src/include/storage/radix.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef RADIX_H
+#define RADIX_H
+
+typedef uint64 bfm_key_type;
+typedef uint64 bfm_value_type;
+//typedef uint32 bfm_value_type;
+//typedef char bfm_value_type;
+
+/* How many different size classes are there */
+#define BFM_KIND_COUNT 6
+
+typedef enum bfm_tree_node_kind
+{
+ BFM_KIND_1,
+ BFM_KIND_4,
+ BFM_KIND_16,
+ BFM_KIND_32,
+ BFM_KIND_128,
+ BFM_KIND_MAX
+} bfm_tree_node_kind;
+
+struct MemoryContextData;
+struct bfm_tree_node;
+
+/* NB: makes things a bit slower */
+#define BFM_STATS
+
+#define BFM_USE_SLAB
+//#define BFM_USE_OS
+
+/*
+ * A radix tree with nodes that are sized based on occupancy.
+ */
+typedef struct bfm_tree
+{
+ struct bfm_tree_node *rnode;
+ uint64 maxval;
+
+ struct MemoryContextData *context;
+#ifdef BFM_USE_SLAB
+ struct MemoryContextData *inner_slabs[BFM_KIND_COUNT];
+ struct MemoryContextData *leaf_slabs[BFM_KIND_COUNT];
+#endif
+
+#ifdef BFM_STATS
+ /* stats */
+ size_t entries;
+ size_t inner_nodes[BFM_KIND_COUNT];
+ size_t leaf_nodes[BFM_KIND_COUNT];
+#endif
+} bfm_tree;
+
+extern void bfm_init(bfm_tree *root);
+extern bool bfm_lookup(bfm_tree *root, bfm_key_type key, bfm_value_type *val);
+extern bool bfm_set(bfm_tree *root, bfm_key_type key, bfm_value_type val);
+extern bool bfm_delete(bfm_tree *root, bfm_key_type key);
+
+extern struct StringInfoData* bfm_stats(bfm_tree *root);
+extern void bfm_print(bfm_tree *root);
+
+extern void bfm_tests(void);
+
+#endif
--
2.32.0.rc2
0003-Add-radix-tree-benchmark-integration.patchtext/x-diff; charset=us-asciiDownload
From 131074dcbe72ff8af00cb879c7c92747dc100e69 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 19 Jul 2021 16:05:44 -0700
Subject: [PATCH 3/3] Add radix tree benchmark integration.
---
bdbench/Makefile | 2 +-
bdbench/bdbench--1.0.sql | 4 +
bdbench/bdbench.c | 181 ++++++++++++++++++++++++++++++++++++++-
bdbench/bench.sql | 6 +-
4 files changed, 189 insertions(+), 4 deletions(-)
diff --git a/bdbench/Makefile b/bdbench/Makefile
index 6d52940..723132a 100644
--- a/bdbench/Makefile
+++ b/bdbench/Makefile
@@ -2,7 +2,7 @@
MODULE_big = bdbench
DATA = bdbench--1.0.sql
-OBJS = bdbench.o vtbm.o rtbm.o
+OBJS = bdbench.o vtbm.o rtbm.o radix.o
EXTENSION = bdbench
REGRESS= bdbench
diff --git a/bdbench/bdbench--1.0.sql b/bdbench/bdbench--1.0.sql
index 933cf71..bd59293 100644
--- a/bdbench/bdbench--1.0.sql
+++ b/bdbench/bdbench--1.0.sql
@@ -109,3 +109,7 @@ RETURNS text
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT VOLATILE;
+CREATE FUNCTION radix_run_tests()
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE;
diff --git a/bdbench/bdbench.c b/bdbench/bdbench.c
index 1df5c53..85d8eaa 100644
--- a/bdbench/bdbench.c
+++ b/bdbench/bdbench.c
@@ -19,6 +19,7 @@
#include "vtbm.h"
#include "rtbm.h"
+#include "radix.h"
//#define DEBUG_DUMP_MATCHED 1
@@ -89,6 +90,7 @@ PG_FUNCTION_INFO_V1(attach_dead_tuples);
PG_FUNCTION_INFO_V1(bench);
PG_FUNCTION_INFO_V1(test_generate_tid);
PG_FUNCTION_INFO_V1(rtbm_test);
+PG_FUNCTION_INFO_V1(radix_run_tests);
PG_FUNCTION_INFO_V1(prepare);
/*
@@ -137,6 +139,16 @@ static void rtbm_attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk,
static bool rtbm_reaped(LVTestType *lvtt, ItemPointer itemptr);
static Size rtbm_mem_usage(LVTestType *lvtt);
+/* radix */
+static void radix_init(LVTestType *lvtt, uint64 nitems);
+static void radix_fini(LVTestType *lvtt);
+static void radix_attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk,
+ BlockNumber maxblk, OffsetNumber maxoff);
+static bool radix_reaped(LVTestType *lvtt, ItemPointer itemptr);
+static Size radix_mem_usage(LVTestType *lvtt);
+static void radix_load(void *tbm, ItemPointerData *itemptrs, int nitems);
+
+
/* Misc functions */
static void generate_index_tuples(uint64 nitems, BlockNumber minblk,
BlockNumber maxblk, OffsetNumber maxoff);
@@ -156,12 +168,13 @@ static void load_rtbm(RTbm *vtbm, ItemPointerData *itemptrs, int nitems);
.dtinfo = {0}, \
.name = #n, \
.init_fn = n##_init, \
+ .fini_fn = n##_fini, \
.attach_fn = n##_attach, \
.reaped_fn = n##_reaped, \
.mem_usage_fn = n##_mem_usage, \
}
-#define TEST_SUBJECT_TYPES 5
+#define TEST_SUBJECT_TYPES 6
static LVTestType LVTestSubjects[TEST_SUBJECT_TYPES] =
{
DECLARE_SUBJECT(array),
@@ -169,6 +182,7 @@ static LVTestType LVTestSubjects[TEST_SUBJECT_TYPES] =
DECLARE_SUBJECT(intset),
DECLARE_SUBJECT(vtbm),
DECLARE_SUBJECT(rtbm),
+ DECLARE_SUBJECT(radix)
};
static bool
@@ -192,6 +206,31 @@ update_info(DeadTupleInfo *info, uint64 nitems, BlockNumber minblk,
info->maxoff = maxoff;
}
+
+/* from geqo's init_tour(), geqo_randint() */
+static int
+shuffle_randrange(unsigned short xseed[3], int lower, int upper)
+{
+ return (int) floor( pg_erand48(xseed) * ((upper-lower)+0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(uint64 nitems, ItemPointer itemptrs)
+{
+ /* reproducability */
+ unsigned short xseed[3] = {0};
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(xseed, i, nitems - 1);
+ ItemPointerData t = itemptrs[j];
+
+ itemptrs[j] = itemptrs[i];
+ itemptrs[i] = t;
+ }
+}
+
static void
generate_index_tuples(uint64 nitems, BlockNumber minblk, BlockNumber maxblk,
OffsetNumber maxoff)
@@ -586,6 +625,138 @@ load_rtbm(RTbm *rtbm, ItemPointerData *itemptrs, int nitems)
rtbm_add_tuples(rtbm, curblkno, offs, noffs);
}
+/* ---------- radix ---------- */
+static void
+radix_init(LVTestType *lvtt, uint64 nitems)
+{
+ MemoryContext old_ctx;
+
+ lvtt->mcxt = AllocSetContextCreate(TopMemoryContext,
+ "radix bench",
+ ALLOCSET_DEFAULT_SIZES);
+ old_ctx = MemoryContextSwitchTo(lvtt->mcxt);
+ lvtt->private = palloc(sizeof(bfm_tree));
+ bfm_init(lvtt->private);
+ MemoryContextSwitchTo(old_ctx);
+}
+static void
+radix_fini(LVTestType *lvtt)
+{
+#if 0
+ if (lvtt->private)
+ bfm_free((RTbm *) lvtt->private);
+#endif
+}
+
+/* log(sizeof(bfm_value_type) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+#define ENCODE_BITS 6
+
+static uint64
+radix_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= ItemPointerGetBlockNumber(tid) << shift;
+
+ *off = tid_i & ((1 << ENCODE_BITS)-1);
+ upper = tid_i >> ENCODE_BITS;
+ Assert(*off < (sizeof(bfm_value_type) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static void
+radix_attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk,
+ BlockNumber maxblk, OffsetNumber maxoff)
+{
+ MemoryContext oldcontext = MemoryContextSwitchTo(lvtt->mcxt);
+
+ radix_load(lvtt->private,
+ DeadTuples_orig->itemptrs,
+ DeadTuples_orig->dtinfo.nitems);
+
+ MemoryContextSwitchTo(oldcontext);
+}
+
+
+static bool
+radix_reaped(LVTestType *lvtt, ItemPointer itemptr)
+{
+ uint64 key;
+ uint32 off;
+ bfm_value_type val;
+
+ key = radix_to_key_off(itemptr, &off);
+
+ if (!bfm_lookup((bfm_tree *) lvtt->private, key, &val))
+ return false;
+
+ return val & ((bfm_value_type)1 << off);
+}
+
+static uint64
+radix_mem_usage(LVTestType *lvtt)
+{
+ bfm_tree *root = lvtt->private;
+ size_t mem = MemoryContextMemAllocated(lvtt->mcxt, true);
+ StringInfo s;
+
+ s = bfm_stats(root);
+
+ ereport(NOTICE,
+ errmsg("radix tree of %.2f MB, %s",
+ (double) mem / (1024 * 1024),
+ s->data),
+ errhidestmt(true),
+ errhidecontext(true));
+
+ pfree(s->data);
+ pfree(s);
+
+ return mem;
+}
+
+static void
+radix_load(void *tbm, ItemPointerData *itemptrs, int nitems)
+{
+ bfm_tree *root = (bfm_tree *) tbm;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+
+ for (int i = 0; i < nitems; i++)
+ {
+ ItemPointer tid = &(itemptrs[i]);
+ uint64 key;
+ uint32 off;
+
+ key = radix_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX &&
+ last_key != key)
+ {
+ bfm_set(root, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64)1 << off;
+ }
+
+ if (last_key != PG_UINT64_MAX)
+ {
+ bfm_set(root, last_key, val);
+ }
+}
+
+
static void
attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk, BlockNumber maxblk,
OffsetNumber maxoff)
@@ -952,3 +1123,11 @@ rtbm_test(PG_FUNCTION_ARGS)
PG_RETURN_NULL();
}
+
+Datum
+radix_run_tests(PG_FUNCTION_ARGS)
+{
+ bfm_tests();
+
+ PG_RETURN_VOID();
+}
diff --git a/bdbench/bench.sql b/bdbench/bench.sql
index c5ef2d3..94cfde0 100644
--- a/bdbench/bench.sql
+++ b/bdbench/bench.sql
@@ -11,10 +11,11 @@ select prepare(
-- Load dead tuples to all data structures.
select 'array', attach_dead_tuples('array');
-select 'tbm', attach_dead_tuples('tbm');
select 'intset', attach_dead_tuples('intset');
-select 'vtbm', attach_dead_tuples('vtbm');
select 'rtbm', attach_dead_tuples('rtbm');
+select 'tbm', attach_dead_tuples('tbm');
+select 'vtbm', attach_dead_tuples('vtbm');
+select 'radix', attach_dead_tuples('radix');
-- Do benchmark of lazy_tid_reaped.
select 'array bench', bench('array');
@@ -22,6 +23,7 @@ select 'intset bench', bench('intset');
select 'rtbm bench', bench('rtbm');
select 'tbm bench', bench('tbm');
select 'vtbm bench', bench('vtbm');
+select 'radix', bench('radix');
-- Check the memory usage.
select * from pg_backend_memory_contexts where name ~ 'bench' or name = 'TopMemoryContext' order by name;
--
2.32.0.rc2
Hi,
On 2021-07-19 16:49:15 -0700, Andres Freund wrote:
E.g. for
select prepare(
1000000, -- max block
20, -- # of dead tuples per page
10, -- dead tuples interval within a page
1 -- page inteval
);
attach size shuffled ordered
array 69 ms 120 MB 84.87 s 8.66 s
intset 173 ms 65 MB 68.82 s 11.75 s
rtbm 201 ms 67 MB 11.54 s 1.35 s
tbm 232 ms 100 MB 8.33 s 1.26 s
vtbm 162 ms 58 MB 10.01 s 1.22 s
radix 88 ms 42 MB 11.49 s 1.67 sand for
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
1 -- page inteval
);attach size shuffled ordered
array 24 ms 60MB 3.74s 1.02 s
intset 97 ms 49MB 3.14s 0.75 s
rtbm 138 ms 36MB 0.41s 0.14 s
tbm 198 ms 101MB 0.41s 0.14 s
vtbm 118 ms 27MB 0.39s 0.12 s
radix 33 ms 10MB 0.28s 0.10 s
Oh, I forgot: The performance numbers are with the fixes in
/messages/by-id/20210717194333.mr5io3zup3kxahfm@alap3.anarazel.de
applied.
Greetings,
Andres Freund
Hi,
I've dreamed to write more compact structure for vacuum for three
years, but life didn't give me a time to.
Let me join to friendly competition.
I've bet on HATM approach: popcount-ing bitmaps for non-empty elements.
Novelties:
- 32 consecutive pages are stored together in a single sparse array
(called "chunks").
Chunk contains:
- its number,
- 4 byte bitmap of non-empty pages,
- array of non-empty page headers 2 byte each.
Page header contains offset of page's bitmap in bitmaps container.
(Except if there is just one dead tuple in a page. Then it is
written into header itself).
- container of concatenated bitmaps.
Ie, page metadata overhead varies from 2.4byte (32pages in single
chunk)
to 18byte (1 page in single chunk) per page.
- If page's bitmap is sparse ie contains a lot of "all-zero" bytes,
it is compressed by removing zero byte and indexing with two-level
bitmap index.
Two-level index - zero bytes in first level are removed using
second level. It is mostly done for 32kb pages, but let it stay since
it is almost free.
- If page's bitmaps contains a lot of "all-one" bytes, it is inverted
and then encoded as sparse.
- Chunks are allocated with custom "allocator" that has no
per-allocation overhead. It is possible because there is no need
to perform "free": allocator is freed as whole at once.
- Array of pointers to chunks is also bitmap indexed. It saves cpu time
when not every 32 consecutive pages has at least one dead tuple.
But consumes time otherwise. Therefore additional optimization is
added
to quick skip lookup for first non-empty run of chunks.
(Ahhh, I believe this explanation is awful).
Andres Freund wrote 2021-07-20 02:49:
Hi,
On 2021-07-19 15:20:54 +0900, Masahiko Sawada wrote:
BTW is the implementation of the radix tree approach available
somewhere? If so I'd like to experiment with that too.I have toyed with implementing adaptively large radix nodes like
proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
gotten it quite working.That seems promising approach.
I've since implemented some, but not all of the ideas of that paper
(adaptive node sizes, but not the tree compression pieces).E.g. for
select prepare(
1000000, -- max block
20, -- # of dead tuples per page
10, -- dead tuples interval within a page
1 -- page inteval
);
attach size shuffled ordered
array 69 ms 120 MB 84.87 s 8.66 s
intset 173 ms 65 MB 68.82 s 11.75 s
rtbm 201 ms 67 MB 11.54 s 1.35 s
tbm 232 ms 100 MB 8.33 s 1.26 s
vtbm 162 ms 58 MB 10.01 s 1.22 s
radix 88 ms 42 MB 11.49 s 1.67 sand for
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
1 -- page inteval
);attach size shuffled ordered
array 24 ms 60MB 3.74s 1.02 s
intset 97 ms 49MB 3.14s 0.75 s
rtbm 138 ms 36MB 0.41s 0.14 s
tbm 198 ms 101MB 0.41s 0.14 s
vtbm 118 ms 27MB 0.39s 0.12 s
radix 33 ms 10MB 0.28s 0.10 s(this is an almost unfairly good case for radix)
Running out of time to format the results of the other testcases before
I have to run, unfortunately. radix uses 42MB both in test case 3 and
4.
My results (Ubuntu 20.04 Intel Core i7-1165G7):
Test1.
select prepare(1000000, 10, 20, 1); -- original
attach size shuffled
array 29ms 60MB 93.99s
intset 93ms 49MB 80.94s
rtbm 171ms 67MB 14.05s
tbm 238ms 100MB 8.36s
vtbm 148ms 59MB 9.12s
radix 100ms 42MB 11.81s
svtm 75ms 29MB 8.90s
select prepare(1000000, 20, 10, 1); -- Andres's variant
attach size shuffled
array 61ms 120MB 111.91s
intset 163ms 66MB 85.00s
rtbm 236ms 67MB 10.72s
tbm 290ms 100MB 8.40s
vtbm 190ms 59MB 9.28s
radix 117ms 42MB 12.00s
svtm 98ms 29MB 8.77s
Test2.
select prepare(1000000, 10, 1, 1);
attach size shuffled
array 31ms 60MB 4.68s
intset 97ms 49MB 4.03s
rtbm 163ms 36MB 0.42s
tbm 240ms 100MB 0.42s
vtbm 136ms 27MB 0.36s
radix 60ms 10MB 0.72s
svtm 39ms 6MB 0.19s
(Bad radix result probably due to smaller cache in notebook's CPU ?)
Test3
select prepare(1000000, 2, 100, 1);
attach size shuffled
array 6ms 12MB 53.42s
intset 23ms 16MB 54.99s
rtbm 115ms 38MB 8.19s
tbm 186ms 100MB 8.37s
vtbm 105ms 59MB 9.08s
radix 64ms 42MB 10.41s
svtm 73ms 10MB 7.49s
Test4
select prepare(1000000, 100, 1, 1);
attach size shuffled
array 304ms 600MB 75.12s
intset 775ms 98MB 47.49s
rtbm 356ms 38MB 4.11s
tbm 539ms 100MB 4.20s
vtbm 493ms 42MB 4.44s
radix 263ms 42MB 6.05s
svtm 360ms 8MB 3.49s
Therefore Specialized Vaccum Tid Map always consumes least memory amount
and usually faster.
(I've applied Andres's patch for slab allocator before testing)
Attached patch is against 6753911a444e12e4b55 commit of your pgtools
with
applied Andres's patches for radix method.
I've also pushed it to github:
https://github.com/funny-falcon/pgtools/tree/svtm/bdbench
regards,
Yura Sokolov
Attachments:
0001-svtm-specialized-vacuum-tid-map.patchtext/x-diff; name=0001-svtm-specialized-vacuum-tid-map.patchDownload
From 3a6c96cc705b1af412cf9300be6f676f6c5e4aa6 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <funny.falcon@gmail.com>
Date: Sun, 25 Jul 2021 03:06:48 +0300
Subject: [PATCH] svtm - specialized vacuum tid map
---
bdbench/Makefile | 2 +-
bdbench/bdbench.c | 91 ++++++-
bdbench/bench.sql | 2 +
bdbench/svtm.c | 635 ++++++++++++++++++++++++++++++++++++++++++++++
bdbench/svtm.h | 19 ++
5 files changed, 746 insertions(+), 3 deletions(-)
create mode 100644 bdbench/svtm.c
create mode 100644 bdbench/svtm.h
diff --git a/bdbench/Makefile b/bdbench/Makefile
index 723132a..a6f758f 100644
--- a/bdbench/Makefile
+++ b/bdbench/Makefile
@@ -2,7 +2,7 @@
MODULE_big = bdbench
DATA = bdbench--1.0.sql
-OBJS = bdbench.o vtbm.o rtbm.o radix.o
+OBJS = bdbench.o vtbm.o rtbm.o radix.o svtm.o
EXTENSION = bdbench
REGRESS= bdbench
diff --git a/bdbench/bdbench.c b/bdbench/bdbench.c
index 85d8eaa..a8bc49a 100644
--- a/bdbench/bdbench.c
+++ b/bdbench/bdbench.c
@@ -7,6 +7,7 @@
#include "postgres.h"
+#include <math.h>
#include "catalog/index.h"
#include "fmgr.h"
#include "funcapi.h"
@@ -20,6 +21,7 @@
#include "vtbm.h"
#include "rtbm.h"
#include "radix.h"
+#include "svtm.h"
//#define DEBUG_DUMP_MATCHED 1
@@ -148,6 +150,15 @@ static bool radix_reaped(LVTestType *lvtt, ItemPointer itemptr);
static Size radix_mem_usage(LVTestType *lvtt);
static void radix_load(void *tbm, ItemPointerData *itemptrs, int nitems);
+/* svtm */
+static void svtm_init(LVTestType *lvtt, uint64 nitems);
+static void svtm_fini(LVTestType *lvtt);
+static void svtm_attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk,
+ BlockNumber maxblk, OffsetNumber maxoff);
+static bool svtm_reaped(LVTestType *lvtt, ItemPointer itemptr);
+static Size svtm_mem_usage(LVTestType *lvtt);
+static void svtm_load(SVTm *tbm, ItemPointerData *itemptrs, int nitems);
+
/* Misc functions */
static void generate_index_tuples(uint64 nitems, BlockNumber minblk,
@@ -174,7 +185,7 @@ static void load_rtbm(RTbm *vtbm, ItemPointerData *itemptrs, int nitems);
.mem_usage_fn = n##_mem_usage, \
}
-#define TEST_SUBJECT_TYPES 6
+#define TEST_SUBJECT_TYPES 7
static LVTestType LVTestSubjects[TEST_SUBJECT_TYPES] =
{
DECLARE_SUBJECT(array),
@@ -182,7 +193,8 @@ static LVTestType LVTestSubjects[TEST_SUBJECT_TYPES] =
DECLARE_SUBJECT(intset),
DECLARE_SUBJECT(vtbm),
DECLARE_SUBJECT(rtbm),
- DECLARE_SUBJECT(radix)
+ DECLARE_SUBJECT(radix),
+ DECLARE_SUBJECT(svtm)
};
static bool
@@ -756,6 +768,81 @@ radix_load(void *tbm, ItemPointerData *itemptrs, int nitems)
}
}
+/* ------------ svtm ----------- */
+static void
+svtm_init(LVTestType *lvtt, uint64 nitems)
+{
+ MemoryContext old_ctx;
+
+ lvtt->mcxt = AllocSetContextCreate(TopMemoryContext,
+ "svtm bench",
+ ALLOCSET_DEFAULT_SIZES);
+ old_ctx = MemoryContextSwitchTo(lvtt->mcxt);
+ lvtt->private = svtm_create();
+ MemoryContextSwitchTo(old_ctx);
+}
+
+static void
+svtm_fini(LVTestType *lvtt)
+{
+ if (lvtt->private != NULL)
+ svtm_free(lvtt->private);
+}
+
+static void
+svtm_attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk,
+ BlockNumber maxblk, OffsetNumber maxoff)
+{
+ MemoryContext oldcontext = MemoryContextSwitchTo(lvtt->mcxt);
+
+ svtm_load(lvtt->private,
+ DeadTuples_orig->itemptrs,
+ DeadTuples_orig->dtinfo.nitems);
+
+ MemoryContextSwitchTo(oldcontext);
+}
+
+static bool
+svtm_reaped(LVTestType *lvtt, ItemPointer itemptr)
+{
+ return svtm_lookup(lvtt->private, itemptr);
+}
+
+static uint64
+svtm_mem_usage(LVTestType *lvtt)
+{
+ svtm_stats((SVTm *) lvtt->private);
+ return MemoryContextMemAllocated(lvtt->mcxt, true);
+}
+
+static void
+svtm_load(SVTm *svtm, ItemPointerData *itemptrs, int nitems)
+{
+ BlockNumber curblkno = InvalidBlockNumber;
+ OffsetNumber offs[1024];
+ int noffs = 0;
+
+ for (int i = 0; i < nitems; i++)
+ {
+ ItemPointer tid = &(itemptrs[i]);
+ BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+
+ if (curblkno != InvalidBlockNumber &&
+ curblkno != blkno)
+ {
+ svtm_add_page(svtm, curblkno, offs, noffs);
+ curblkno = blkno;
+ noffs = 0;
+ }
+
+ curblkno = blkno;
+ offs[noffs++] = ItemPointerGetOffsetNumber(tid);
+ }
+
+ svtm_add_page(svtm, curblkno, offs, noffs);
+ svtm_finalize_addition(svtm);
+}
+
static void
attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk, BlockNumber maxblk,
diff --git a/bdbench/bench.sql b/bdbench/bench.sql
index 94cfde0..b303591 100644
--- a/bdbench/bench.sql
+++ b/bdbench/bench.sql
@@ -16,6 +16,7 @@ select 'rtbm', attach_dead_tuples('rtbm');
select 'tbm', attach_dead_tuples('tbm');
select 'vtbm', attach_dead_tuples('vtbm');
select 'radix', attach_dead_tuples('radix');
+select 'svtm', attach_dead_tuples('svtm');
-- Do benchmark of lazy_tid_reaped.
select 'array bench', bench('array');
@@ -24,6 +25,7 @@ select 'rtbm bench', bench('rtbm');
select 'tbm bench', bench('tbm');
select 'vtbm bench', bench('vtbm');
select 'radix', bench('radix');
+select 'svtm', bench('svtm');
-- Check the memory usage.
select * from pg_backend_memory_contexts where name ~ 'bench' or name = 'TopMemoryContext' order by name;
diff --git a/bdbench/svtm.c b/bdbench/svtm.c
new file mode 100644
index 0000000..6ce4ed9
--- /dev/null
+++ b/bdbench/svtm.c
@@ -0,0 +1,635 @@
+/*------------------------------------------------------------------------------
+ *
+ * svtm.c - Specialized Vacuum TID Map
+ * Data structure to hold TIDs of dead tuples during vacuum.
+ *
+ * It takes in account following properties of PostgreSQL ItemPointer and
+ * vacuum heap scan process:
+ * - page number is 32bit integer
+ * - 14 bit is enough for tuple offset.
+ * - but usually number of tuples is significantly lesser
+ * - and 0 is InvalidOffset
+ * - heap is scanned sequentially therefore pages are in increasing order,
+ * - tuples of a single page could be added at once.
+ *
+ * It uses techniques from HATM (Hash Array Mapped Trie), and Roaring bitmaps.
+ *
+ * # Page.
+ *
+ * Page information consists of 16 bit page header and bitmap or sparse bitmap
+ * container. Header and bitmap contains different information
+ * depending on high bits of header.
+ *
+ * Sparse bitmap is made from raw bitmap by skipping all-zero bytes. Non-zero
+ * bytes than indexed with bitmap of sparseness.
+ *
+ * If bitmap contains a lot of all-one bytes, then it is inverted before
+ * going to be sparse.
+ *
+ * Kinds of header/bitmap:
+ * - embedded 1 offset
+ * high bits: 11
+ * lower bits: 14bit tuple offset
+ * bitmap: no external bitmap
+ *
+ * - raw bitmap
+ * high bits: 00
+ * lower bits: 14bit offset in bitmap container
+ * bitmap: 1 byte bitmap length = K
+ * K byte raw bitmap
+ * This container is used if there is no detectable pattern in offsets.
+ *
+ * - sparse bitmap
+ * high bits: 10
+ * lower bits: 14bit offset in bitmap container
+ * bitmap: 1 byte raw bitmap length = K
+ * 1 byte sparseness bitmap length = S
+ * S bytes sparseness bitmap
+ * Z bytes of non-zero bitmap bytes
+ * If raw bitmap contains > 62.5% of zero bytes, then sparse bitmap format is
+ * chosen.
+ *
+ * - inverted sparse bitmap
+ * high bits: 10
+ * lower bits: 14bit offset in bitmap container
+ * bitmap: 1 byte raw bitmap length = K
+ * 1 byte sparseness bitmap length = S
+ * S bytes sparseness bitmap
+ * Z bytes of non-zero inverted bitmap bytes
+ * If raw bitmap contains > 62.5% of all-ones bytes, then sparse bitmap format
+ * is used to encode whenever tuple is not dead instead.
+ *
+ * # Page map chunk.
+ *
+ * 32 consecutive page headers are stored in an sparse array together with
+ * their bitmaps. Pages without any dead tuple are skipped from this array.
+ *
+ * Therefore chunk map contains:
+ * - 32bitmap of pages presence
+ * - array of 0-32 page headers
+ * - byte array of concatenated bitmaps for all pages in a chunk (with offsets
+ * encoded in page headers).
+ *
+ * Maximum chunk size:
+ * - page header map: 4 + 32*2 = 68 bytes
+ * - bitmaps byte array:
+ * 32kb page: 32 * 148 = 4736 byte
+ * 8kb page: 32 * 36 = 1152 byte
+ * - sum:
+ * 32kb page: 4804 bytes
+ * 8kb page: 1220 bytes
+ *
+ * Each chunk is allocated as a single blob.
+ *
+ * # Page chunk map.
+ *
+ * Pointers to chunks are stored into sparse array indexed with ixmap bitmap.
+ * Number of first non-empty chunk and first empty chunk after it are
+ * remembered to reduce size of bitmap and speedup access to first run
+ * of non-empty chunks.
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "lib/stringinfo.h"
+#include "port/pg_bitutils.h"
+
+#include "svtm.h"
+
+#define PAGES_PER_CHUNK (1<<5)
+#define BITMAP_PER_PAGE (MaxHeapTuplesPerPage/8 + 1)
+#define PAGE_TO_CHUNK(blkno) ((uint32)(blkno)>>5)
+#define CHUNK_TO_PAGE(chunkno) ((chunkno)<<5)
+
+#define SVTAllocChunk ((1<<19)-128)
+
+typedef struct SVTPagesChunk SVTPagesChunk;
+typedef struct SVTChunkBuilder SVTChunkBuilder;
+typedef struct SVTAlloc SVTAlloc;
+typedef struct IxMap IxMap;
+typedef uint16 SVTHeader;
+
+struct SVTAlloc {
+ SVTAlloc* next;
+ Size pos;
+ Size limit;
+ uint8 bytes[FLEXIBLE_ARRAY_MEMBER];
+};
+
+struct SVTChunkBuilder
+{
+ uint32 chunk_number;
+ uint32 npages;
+ uint32 bitmaps_pos;
+ uint32 hcnt[4];
+ BlockNumber pages[PAGES_PER_CHUNK];
+ SVTHeader headers[PAGES_PER_CHUNK];
+ /* we add 3 for BITMAP_PER_PAGE for 4 byte roundup */
+ uint8 bitmaps[(BITMAP_PER_PAGE+3)*PAGES_PER_CHUNK];
+};
+
+struct IxMap {
+ uint32 bitmap;
+ uint32 offset;
+};
+
+struct SVTm
+{
+ BlockNumber lastblock; /* max block number + 1 */
+ struct {
+ uint32 start, end;
+ } firstrun;
+ uint32 nchunks;
+ SVTPagesChunk **chunks; /* chunks pointers */
+ IxMap *ixmap; /* compression map for chunks */
+ Size total_size;
+ SVTAlloc *alloc;
+
+ uint32 npages;
+ uint32 hcnt[4];
+
+ SVTChunkBuilder builder; /* builder for current chunk */
+};
+
+struct SVTPagesChunk
+{
+ uint32 chunk_number;
+ uint32 bitmap;
+ SVTHeader headers[FLEXIBLE_ARRAY_MEMBER];
+};
+
+#define bm2(b,c) (((b)<<1)|(c))
+enum SVTHeaderType {
+ SVTH_rawBitmap = bm2(0,0),
+ SVTH_inverseBitmap = bm2(0,1),
+ SVTH_sparseBitmap = bm2(1,0),
+ SVTH_single = bm2(1,1),
+};
+#define HeaderTypeOffset (14)
+#define MakeHeaderType(l) ((SVTHeader)(l) << HeaderTypeOffset)
+#define HeaderType(h) (((h)>>14)&3)
+
+#define BitmapPosition(h) ((h) & ((1<<14)-1))
+#define MakeBitmapPosition(l) ((l) & ((1<<14)-1))
+#define MaxBitmapPosition ((1<<14)-1)
+
+#define SingleItem(h) ((h) & ((1<<14)-1))
+#define MakeSingleItem(h) ((h) & ((1<<14)-1))
+
+/*
+ * we could not use pg_popcount32 in contrib in windows,
+ * therefore define our own.
+ */
+#define INVALID_INDEX (~(uint32)0)
+const uint8 four_bit_cnt[32] = {
+ 0, 1, 1, 2, 1, 2, 2, 3,
+ 1, 2, 2, 3, 2, 3, 3, 4,
+ 1, 2, 2, 3, 2, 3, 3, 4,
+ 2, 3, 3, 4, 3, 4, 4, 5,
+};
+
+#define makeoff(v, bits) ((v)/bits)
+#define makebit(v, bits) (1<<((v)&((bits)-1)))
+#define maskbits(v, vits) ((v) & ((1<<(bits))-1))
+#define bitszero(v, vits) (maskbits((v), (bits)) == 0)
+
+static inline uint32 svt_popcnt32(uint32 val);
+static void svtm_build_chunk(SVTm *store);
+
+static inline uint32
+svt_popcnt8(uint8 val)
+{
+ return four_bit_cnt[val&15] + four_bit_cnt[(val>>4)&15];
+}
+
+static inline uint32
+svt_popcnt32(uint32 val)
+{
+ return pg_popcount32(val);
+}
+
+static SVTAlloc*
+svtm_alloc_alloc(void)
+{
+ SVTAlloc *alloc = palloc0(SVTAllocChunk);
+ alloc->limit = SVTAllocChunk - offsetof(SVTAlloc, bytes);
+ return alloc;
+}
+
+SVTm*
+svtm_create(void)
+{
+ SVTm* store = palloc0(sizeof(SVTm));
+ /* preallocate chunks just to pass it to repalloc later */
+ store->chunks = palloc(sizeof(SVTPagesChunk*)*2);
+ store->alloc = svtm_alloc_alloc();
+ return store;
+}
+
+static void*
+svtm_alloc(SVTm *store, Size size)
+{
+ SVTAlloc *alloc = store->alloc;
+ void *res;
+
+ size = INTALIGN(size);
+
+ if (alloc->limit - alloc->pos < size)
+ {
+ alloc = svtm_alloc_alloc();
+ alloc->next = store->alloc;
+ store->alloc = alloc;
+ }
+
+ res = alloc->bytes + alloc->pos;
+ alloc->pos += size;
+
+ return res;
+}
+
+void
+svtm_free(SVTm *store)
+{
+ SVTAlloc *alloc, *next;
+
+ if (store == NULL)
+ return;
+ if (store->ixmap != NULL)
+ pfree(store->ixmap);
+ if (store->chunks != NULL)
+ pfree(store->chunks);
+ alloc = store->alloc;
+ while (alloc != NULL)
+ {
+ next = alloc->next;
+ pfree(alloc);
+ alloc = next;
+ }
+ pfree(store);
+}
+
+void
+svtm_add_page(SVTm *store, const BlockNumber blkno,
+ const OffsetNumber *offnums, uint32 nitems)
+{
+ SVTChunkBuilder *bld = &store->builder;
+ SVTHeader header = 0;
+ uint32 chunkno = PAGE_TO_CHUNK(blkno);
+ uint32 bmlen = 0, bbmlen = 0, bbbmlen = 0;
+ uint32 sbmlen = 0;
+ uint32 nonzerocnt;
+ uint32 allzerocnt = 0, allonecnt = 0;
+ uint32 firstoff, lastoff;
+ uint32 i, j;
+ uint8 *append;
+ uint8 bitmap[BITMAP_PER_PAGE] = {0};
+ uint8 spix1[BITMAP_PER_PAGE/8+1] = {0};
+ uint8 spix2[BITMAP_PER_PAGE/64+2] = {0};
+#define off(i) (offnums[i]-1)
+
+ if (nitems == 0)
+ return;
+
+ if (chunkno != bld->chunk_number)
+ {
+ Assert(chunkno > bld->chunk_number);
+ svtm_build_chunk(store);
+ bld->chunk_number = chunkno;
+ }
+
+ Assert(bld->npages == 0 || blkno > bld->pages[bld->npages-1]);
+
+ firstoff = off(0);
+ lastoff = off(nitems-1);
+ Assert(lastoff < (1<<11));
+
+ if (nitems == 1 && lastoff < (1<<10))
+ {
+ /* 1 embedded item */
+ header = MakeHeaderType(SVTH_single);
+ header |= firstoff;
+ }
+ else
+ {
+ Assert(bld->bitmaps_pos < MaxBitmapPosition);
+
+ append = bld->bitmaps + bld->bitmaps_pos;
+ header = MakeBitmapPosition(bld->bitmaps_pos);
+ /* calculate bitmap */
+ for (i = 0; i < nitems; i++)
+ {
+ Assert(i == 0 || off(i) < off(i-1));
+ bitmap[makeoff(off(i),8)] |= makebit(off(i), 8);
+ }
+
+ bmlen = lastoff/8 + 1;
+ append[0] = bmlen;
+
+ for (i = 0; i < bmlen; i++)
+ {
+ allzerocnt += bitmap[i] == 0;
+ allonecnt += bitmap[i] == 0xff;
+ }
+
+ /* if we could not abuse sparness of bitmap, pack it as is */
+ if (allzerocnt <= bmlen*5/8 && allonecnt <= bmlen*5/8)
+ {
+ header |= MakeHeaderType(SVTH_rawBitmap);
+ memmove(append+1, bitmap, bmlen);
+ bld->bitmaps_pos += bmlen + 1;
+ }
+ else
+ {
+ /* if there is more present tuples than absent, invert map */
+ if (allonecnt > bmlen*5/8)
+ {
+ header |= MakeHeaderType(SVTH_inverseBitmap);
+ for (i = 0; i < bmlen; i++)
+ bitmap[i] ^= 0xff;
+ nonzerocnt = bmlen - allonecnt;
+ }
+ else
+ {
+ header |= MakeHeaderType(SVTH_sparseBitmap);
+ nonzerocnt = bmlen - allzerocnt;
+ }
+
+ /* Then we compose two level bitmap index for bitmap. */
+
+ /* First compress bitmap itself with first level index */
+ bbmlen = (bmlen+7)/8;
+ j = 0;
+ for (i = 0; i < bmlen; i++)
+ {
+ if (bitmap[i] != 0)
+ {
+ spix1[makeoff(i, 8)] |= makebit(i, 8);
+ bitmap[j] = bitmap[i];
+ j++;
+ }
+ }
+ Assert(j == nonzerocnt);
+
+ /* Then compress first level index with second level */
+ bbbmlen = (bbmlen+7)/8;
+ Assert(bbbmlen <= 3);
+ sbmlen = 0;
+ for (i = 0; i < bbmlen; i++)
+ {
+ if (spix1[i] != 0)
+ {
+ spix2[makeoff(i, 8)] |= makebit(i, 8);
+ spix1[sbmlen] = spix1[i];
+ sbmlen++;
+ }
+ }
+ Assert(sbmlen < 19);
+
+ /*
+ * second byte contains length of first level and offset
+ * to compressed bitmap itself.
+ */
+ append[1] = (bbbmlen << 5) | (bbbmlen + sbmlen);
+ memmove(append+2, spix2, bbbmlen);
+ memmove(append+2+bbbmlen, spix1, sbmlen);
+ memmove(append+2+bbbmlen+sbmlen, bitmap, nonzerocnt);
+ bld->bitmaps_pos += bbbmlen + sbmlen + nonzerocnt + 2;
+ }
+ Assert(bld->bitmaps_pos <= MaxBitmapPosition);
+ }
+ bld->pages[bld->npages] = blkno;
+ bld->headers[bld->npages] = header;
+ bld->npages++;
+ bld->hcnt[HeaderType(header)]++;
+}
+#undef off
+
+static void
+svtm_build_chunk(SVTm *store)
+{
+ SVTChunkBuilder *bld = &store->builder;
+ SVTPagesChunk *chunk;
+ uint32 bitmap = 0;
+ BlockNumber startblock;
+ uint32 off;
+ uint32 i;
+ Size total_size;
+
+ Assert(bld->npages < ~(uint16)0);
+
+ if (bld->npages == 0)
+ return;
+
+ startblock = CHUNK_TO_PAGE(bld->chunk_number);
+ for (i = 0; i < bld->npages; i++)
+ {
+ off = bld->pages[i] - startblock;
+ bitmap |= makebit(off, 32);
+ }
+
+ total_size = offsetof(SVTPagesChunk, headers) +
+ sizeof(SVTHeader)*bld->npages +
+ bld->bitmaps_pos;
+
+ chunk = svtm_alloc(store, total_size);
+ chunk->chunk_number = bld->chunk_number;;
+ chunk->bitmap = bitmap;
+ memmove(chunk->headers,
+ bld->headers, sizeof(SVTHeader)*bld->npages);
+ memmove((char*)(chunk->headers + bld->npages),
+ bld->bitmaps, bld->bitmaps_pos);
+
+ /*
+ * We allocate store->chunks in power-of-two sizes.
+ * Then check for "we will overflow" is equal to "nchunks is power of two".
+ */
+ if ((store->nchunks & (store->nchunks-1)) == 0)
+ {
+ Size new_nchunks = store->nchunks ? (store->nchunks<<1) : 1;
+ store->chunks = (SVTPagesChunk**) repalloc(store->chunks,
+ new_nchunks * sizeof(SVTPagesChunk*));
+ }
+ store->chunks[store->nchunks] = chunk;
+ store->nchunks++;
+ store->lastblock = bld->pages[bld->npages-1];
+ store->total_size += total_size;
+
+ for (i = 0; i<4; i++)
+ store->hcnt[i] += bld->hcnt[i];
+ store->npages += bld->npages;
+
+ memset(bld, 0, sizeof(SVTChunkBuilder));
+}
+
+void
+svtm_finalize_addition(SVTm *store)
+{
+ SVTPagesChunk **chunks = store->chunks;
+ IxMap *ixmap;
+ uint32 last_chunk, chunkno;
+ uint32 firstrun, firstrunend;
+ uint32 nmaps;
+ uint32 i;
+
+ if (store->nchunks == 0)
+ {
+ /*
+ * block number will be rejected with:
+ * block <= lastblock, lastblock == 0
+ * chunk >= firstrun.start, firstrun.start = 1
+ */
+ store->firstrun.start = 1;
+ return;
+ }
+
+ firstrun = chunks[0]->chunk_number;
+ firstrunend = firstrun+1;
+
+ /* adsorb last chunk */
+ svtm_build_chunk(store);
+
+ /* Now we need to build ixmap */
+ last_chunk = PAGE_TO_CHUNK(store->lastblock);
+ nmaps = makeoff(last_chunk, 32) + 1;
+ ixmap = palloc0(nmaps * sizeof(IxMap));
+
+ for (i = 0; i < store->nchunks; i++)
+ {
+ chunkno = chunks[i]->chunk_number;
+ if (chunkno == firstrunend)
+ firstrunend++;
+ chunkno -= firstrun;
+ ixmap[makeoff(chunkno,32)].bitmap |= makebit(chunkno,32);
+ }
+
+ for (i = 1; i < nmaps; i++)
+ {
+ ixmap[i].offset = ixmap[i-1].offset;
+ ixmap[i].offset += svt_popcnt32(ixmap[i-1].bitmap);
+ }
+
+ store->firstrun.start = firstrun;
+ store->firstrun.end = firstrunend;
+ store->ixmap = ixmap;
+}
+
+bool
+svtm_lookup(SVTm *store, ItemPointer tid)
+{
+ BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+ OffsetNumber offset = ItemPointerGetOffsetNumber(tid) - 1;
+ SVTPagesChunk *chunk;
+ IxMap *ixmap = store->ixmap;
+ uint32 off, bit;
+
+ SVTHeader header;
+ uint8 *bitmaps;
+ uint8 *bitmap;
+ uint32 index;
+ uint32 chunkno, blk_in_chunk;
+ uint8 type;
+ uint8 bmoff, bmbit, bmlen, bmbyte;
+ uint8 bmstart, bbmoff, bbmbit, bbmbyte;
+ uint8 bbbmlen, bbbmoff, bbbmbit;
+ uint8 six1off, sbmoff;
+ bool inverse, bitset;
+
+ if (blkno > store->lastblock)
+ return false;
+
+ chunkno = PAGE_TO_CHUNK(blkno);
+ if (chunkno < store->firstrun.start)
+ return false;
+
+ if (chunkno < store->firstrun.end)
+ index = chunkno - store->firstrun.start;
+ else
+ {
+ off = makeoff(chunkno - store->firstrun.start, 32);
+ bit = makebit(chunkno - store->firstrun.start, 32);
+ if ((ixmap[off].bitmap & bit) == 0)
+ return false;
+
+ index = ixmap[off].offset + svt_popcnt32(ixmap[off].bitmap & (bit-1));
+ }
+ chunk = store->chunks[index];
+ Assert(chunkno == chunk->chunk_number);
+
+ blk_in_chunk = blkno - CHUNK_TO_PAGE(chunkno);
+ bit = makebit(blk_in_chunk, 32);
+
+ if ((chunk->bitmap & bit) == 0)
+ return false;
+ index = svt_popcnt32(chunk->bitmap & (bit - 1));
+ header = chunk->headers[index];
+
+ type = HeaderType(header);
+ if (type == SVTH_single)
+ return offset == SingleItem(header);
+
+ bitmaps = (uint8*)(chunk->headers + svt_popcnt32(chunk->bitmap));
+ bmoff = makeoff(offset, 8);
+ bmbit = makebit(offset, 8);
+ inverse = false;
+
+ bitmap = bitmaps + BitmapPosition(header);
+ bmlen = bitmap[0];
+ if (bmoff >= bmlen)
+ return false;
+
+ switch (type)
+ {
+ case SVTH_rawBitmap:
+ return (bitmap[bmoff+1] & bmbit) != 0;
+
+ case SVTH_inverseBitmap:
+ inverse = true;
+ /* fallthrough */
+ case SVTH_sparseBitmap:
+ bmstart = bitmap[1] & 0x1f;
+ bbbmlen = bitmap[1] >> 5;
+ bitmap += 2;
+ bbmoff = makeoff(bmoff, 8);
+ bbmbit = makebit(bmoff, 8);
+ bbbmoff = makeoff(bbmoff, 8);
+ bbbmbit = makebit(bbmoff, 8);
+ /* check bit in second level index */
+ if ((bitmap[bbbmoff] & bbbmbit) == 0)
+ return inverse;
+ /* calculate sparse offset into compressed first level index */
+ six1off = pg_popcount((char*)bitmap, bbbmoff) +
+ svt_popcnt8(bitmap[bbbmoff] & (bbbmbit-1));
+ /* check bit in first level index */
+ bbmbyte = bitmap[bbbmlen+six1off];
+ if ((bbmbyte & bbmbit) == 0)
+ return inverse;
+ /* and sparse offset into compressed bitmap itself */
+ sbmoff = pg_popcount((char*)bitmap+bbbmlen, six1off) +
+ svt_popcnt8(bbmbyte & (bbmbit-1));
+ bmbyte = bitmap[bmstart + sbmoff];
+ /* finally check bit in bitmap */
+ bitset = (bmbyte & bmbit) != 0;
+ return bitset != inverse;
+ }
+ Assert(false);
+ return false;
+}
+
+void svtm_stats(SVTm *store)
+{
+ StringInfo s;
+
+ s = makeStringInfo();
+ appendStringInfo(s, "svtm: nchunks %u npages %u\n",
+ store->nchunks, store->npages);
+ appendStringInfo(s, "single=%u raw=%u inserse=%u sparse=%u",
+ store->hcnt[SVTH_single], store->hcnt[SVTH_rawBitmap],
+ store->hcnt[SVTH_inverseBitmap], store->hcnt[SVTH_sparseBitmap]);
+
+ elog(NOTICE, "%s", s->data);
+ pfree(s->data);
+ pfree(s);
+}
diff --git a/bdbench/svtm.h b/bdbench/svtm.h
new file mode 100644
index 0000000..fdb5e3f
--- /dev/null
+++ b/bdbench/svtm.h
@@ -0,0 +1,19 @@
+#ifndef _SVTM_H
+#define _SVTM_H
+
+/* Specialized Vacuum TID Map */
+typedef struct SVTm SVTm;
+
+SVTm *svtm_create(void);
+void svtm_free(SVTm *store);
+/*
+ * Add page tuple offsets to map.
+ * offnums should be sorted. Max offset number should be < 2048.
+ */
+void svtm_add_page(SVTm *store, const BlockNumber blkno,
+ const OffsetNumber *offnums, uint32 nitems);
+void svtm_finalize_addition(SVTm *store);
+bool svtm_lookup(SVTm *store, ItemPointer tid);
+void svtm_stats(SVTm *store);
+
+#endif
--
2.32.0
On Mon, Jul 26, 2021 at 1:07 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
Hi,
I've dreamed to write more compact structure for vacuum for three
years, but life didn't give me a time to.Let me join to friendly competition.
I've bet on HATM approach: popcount-ing bitmaps for non-empty elements.
Thank you for proposing the new idea!
Novelties:
- 32 consecutive pages are stored together in a single sparse array
(called "chunks").
Chunk contains:
- its number,
- 4 byte bitmap of non-empty pages,
- array of non-empty page headers 2 byte each.
Page header contains offset of page's bitmap in bitmaps container.
(Except if there is just one dead tuple in a page. Then it is
written into header itself).
- container of concatenated bitmaps.Ie, page metadata overhead varies from 2.4byte (32pages in single
chunk)
to 18byte (1 page in single chunk) per page.- If page's bitmap is sparse ie contains a lot of "all-zero" bytes,
it is compressed by removing zero byte and indexing with two-level
bitmap index.
Two-level index - zero bytes in first level are removed using
second level. It is mostly done for 32kb pages, but let it stay since
it is almost free.- If page's bitmaps contains a lot of "all-one" bytes, it is inverted
and then encoded as sparse.- Chunks are allocated with custom "allocator" that has no
per-allocation overhead. It is possible because there is no need
to perform "free": allocator is freed as whole at once.- Array of pointers to chunks is also bitmap indexed. It saves cpu time
when not every 32 consecutive pages has at least one dead tuple.
But consumes time otherwise. Therefore additional optimization is
added
to quick skip lookup for first non-empty run of chunks.
(Ahhh, I believe this explanation is awful).
It sounds better than my proposal.
Andres Freund wrote 2021-07-20 02:49:
Hi,
On 2021-07-19 15:20:54 +0900, Masahiko Sawada wrote:
BTW is the implementation of the radix tree approach available
somewhere? If so I'd like to experiment with that too.I have toyed with implementing adaptively large radix nodes like
proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
gotten it quite working.That seems promising approach.
I've since implemented some, but not all of the ideas of that paper
(adaptive node sizes, but not the tree compression pieces).E.g. for
select prepare(
1000000, -- max block
20, -- # of dead tuples per page
10, -- dead tuples interval within a page
1 -- page inteval
);
attach size shuffled ordered
array 69 ms 120 MB 84.87 s 8.66 s
intset 173 ms 65 MB 68.82 s 11.75 s
rtbm 201 ms 67 MB 11.54 s 1.35 s
tbm 232 ms 100 MB 8.33 s 1.26 s
vtbm 162 ms 58 MB 10.01 s 1.22 s
radix 88 ms 42 MB 11.49 s 1.67 sand for
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
1 -- page inteval
);attach size shuffled ordered
array 24 ms 60MB 3.74s 1.02 s
intset 97 ms 49MB 3.14s 0.75 s
rtbm 138 ms 36MB 0.41s 0.14 s
tbm 198 ms 101MB 0.41s 0.14 s
vtbm 118 ms 27MB 0.39s 0.12 s
radix 33 ms 10MB 0.28s 0.10 s(this is an almost unfairly good case for radix)
Running out of time to format the results of the other testcases before
I have to run, unfortunately. radix uses 42MB both in test case 3 and
4.My results (Ubuntu 20.04 Intel Core i7-1165G7):
Test1.
select prepare(1000000, 10, 20, 1); -- original
attach size shuffled
array 29ms 60MB 93.99s
intset 93ms 49MB 80.94s
rtbm 171ms 67MB 14.05s
tbm 238ms 100MB 8.36s
vtbm 148ms 59MB 9.12s
radix 100ms 42MB 11.81s
svtm 75ms 29MB 8.90sselect prepare(1000000, 20, 10, 1); -- Andres's variant
attach size shuffled
array 61ms 120MB 111.91s
intset 163ms 66MB 85.00s
rtbm 236ms 67MB 10.72s
tbm 290ms 100MB 8.40s
vtbm 190ms 59MB 9.28s
radix 117ms 42MB 12.00s
svtm 98ms 29MB 8.77sTest2.
select prepare(1000000, 10, 1, 1);
attach size shuffled
array 31ms 60MB 4.68s
intset 97ms 49MB 4.03s
rtbm 163ms 36MB 0.42s
tbm 240ms 100MB 0.42s
vtbm 136ms 27MB 0.36s
radix 60ms 10MB 0.72s
svtm 39ms 6MB 0.19s(Bad radix result probably due to smaller cache in notebook's CPU ?)
Test3
select prepare(1000000, 2, 100, 1);
attach size shuffled
array 6ms 12MB 53.42s
intset 23ms 16MB 54.99s
rtbm 115ms 38MB 8.19s
tbm 186ms 100MB 8.37s
vtbm 105ms 59MB 9.08s
radix 64ms 42MB 10.41s
svtm 73ms 10MB 7.49sTest4
select prepare(1000000, 100, 1, 1);
attach size shuffled
array 304ms 600MB 75.12s
intset 775ms 98MB 47.49s
rtbm 356ms 38MB 4.11s
tbm 539ms 100MB 4.20s
vtbm 493ms 42MB 4.44s
radix 263ms 42MB 6.05s
svtm 360ms 8MB 3.49sTherefore Specialized Vaccum Tid Map always consumes least memory amount
and usually faster.
I'll experiment with the proposed ideas including this idea in more
scenarios and share the results tomorrow.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Mon, Jul 26, 2021 at 11:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I'll experiment with the proposed ideas including this idea in more
scenarios and share the results tomorrow.
I've done some benchmarks for proposed data structures. In this trial,
I've done with the scenario where dead tuples are concentrated on a
particular range of table blocks (test 5-8), in addition to the
scenarios I've done in the previous trial. Also, I've done benchmarks
of each scenario while increasing table size. In the first test, the
maximum block number of the table is 1,000,000 (i.g., 8GB table) and
in the second test, it's 10,000,000 (80GB table). We can see how
performance and memory consumption changes with a large-scale table.
Here are the results:
* Test 1
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
1, -- # of consecutive pages having dead tuples
20 -- page interval
);
name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 57.23 MB | 0.040 | 98.613 | 572.21 MB | 0.387 | 1521.981
intset | 46.88 MB | 0.114 | 75.944 | 468.67 MB | 0.961 | 997.760
radix | 40.26 MB | 0.102 | 18.427 | 336.64 MB | 0.797 | 266.146
rtbm | 64.02 MB | 0.234 | 22.443 | 512.02 MB | 2.230 | 275.143
svtm | 27.28 MB | 0.060 | 13.568 | 274.07 MB | 0.476 | 211.073
tbm | 96.01 MB | 0.273 | 10.347 | 768.01 MB | 2.882 | 128.103
* Test 2
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);
name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 57.23 MB | 0.041 | 4.757 | 572.21 MB | 0.344 | 71.228
intset | 46.88 MB | 0.127 | 3.762 | 468.67 MB | 1.093 | 49.573
radix | 9.95 MB | 0.048 | 0.679 | 82.57 MB | 0.371 | 16.211
rtbm | 34.02 MB | 0.179 | 0.534 | 288.02 MB | 2.092 | 8.693
svtm | 5.78 MB | 0.043 | 0.239 | 54.60 MB | 0.342 | 7.759
tbm | 96.01 MB | 0.274 | 0.521 | 768.01 MB | 2.685 | 6.360
* Test 3
select prepare(
1000000, -- max block
2, -- # of dead tuples per page
100, -- dead tuples interval within a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);
name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 11.45 MB | 0.009 | 57.698 | 114.45 MB | 0.076 | 1045.639
intset | 15.63 MB | 0.031 | 46.083 | 156.23 MB | 0.243 | 848.525
radix | 40.26 MB | 0.063 | 13.755 | 336.64 MB | 0.501 | 223.413
rtbm | 36.02 MB | 0.123 | 11.527 | 320.02 MB | 1.843 | 180.977
svtm | 9.28 MB | 0.053 | 9.631 | 92.59 MB | 0.438 | 212.626
tbm | 96.01 MB | 0.228 | 10.381 | 768.01 MB | 2.258 | 126.630
* Test 4
select prepare(
1000000, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);
name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 572.21 MB | 0.367 | 78.047 | 5722.05 MB | 3.942 | 1154.776
intset | 93.74 MB | 0.777 | 45.146 | 937.34 MB | 7.716 | 643.708
radix | 40.26 MB | 0.203 | 9.015 | 336.64 MB | 1.775 | 133.294
rtbm | 36.02 MB | 0.369 | 5.639 | 320.02 MB | 3.823 | 88.832
svtm | 7.28 MB | 0.294 | 3.891 | 73.60 MB | 2.690 | 103.744
tbm | 96.01 MB | 0.534 | 5.223 | 768.01 MB | 5.679 | 60.632
* Test 5
select prepare(
1000000, -- max block
150, -- # of dead tuples per page
1, -- dead tuples interval within a page
10000, -- # of consecutive pages having dead tuples
20000 -- page interval
);
There are 10000 consecutive pages that have 150 dead tuples at every
20000 pages.
name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 429.16 MB | 0.274 | 75.664 | 4291.54 MB | 3.067 | 1259.501
intset | 46.88 MB | 0.559 | 36.449 | 468.67 MB | 4.565 | 517.445
radix | 20.26 MB | 0.166 | 8.466 | 196.90 MB | 1.273 | 166.587
rtbm | 18.02 MB | 0.242 | 8.491 | 160.02 MB | 2.407 | 171.725
svtm | 3.66 MB | 0.243 | 3.635 | 37.10 MB | 2.022 | 86.165
tbm | 48.01 MB | 0.344 | 9.763 | 384.01 MB | 3.327 | 151.824
* Test 6
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
10000, -- # of consecutive pages having dead tuples
20000 -- page interval
);
There are 10000 consecutive pages that have 10 dead tuples at every 20000 pages.
name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 28.62 MB | 0.022 | 2.791 | 286.11 MB | 0.170 | 46.920
intset | 23.45 MB | 0.061 | 2.156 | 234.34 MB | 0.501 | 32.577
radix | 5.04 MB | 0.026 | 0.433 | 48.57 MB | 0.191 | 11.060
rtbm | 17.02 MB | 0.074 | 0.533 | 144.02 MB | 0.954 | 11.502
svtm | 3.16 MB | 0.023 | 0.206 | 27.60 MB | 0.175 | 4.886
tbm | 48.01 MB | 0.132 | 0.656 | 384.01 MB | 1.284 | 10.231
* Test 7
select prepare(
1000000, -- max block
150, -- # of dead tuples per page
1, -- dead tuples interval within a page
1000, -- # of consecutive pages having dead tuples
999000 -- page interval
);
There are pages that have 150 dead tuples at first 1000 blocks and
last 1000 blocks.
name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 1.72 MB | 0.002 | 7.507 | 17.17 MB | 0.011 | 76.510
intset | 0.20 MB | 0.003 | 6.742 | 1.89 MB | 0.022 | 52.122
radix | 0.20 MB | 0.001 | 1.023 | 1.07 MB | 0.007 | 12.023
rtbm | 0.15 MB | 0.001 | 2.637 | 0.65 MB | 0.009 | 34.528
svtm | 0.52 MB | 0.002 | 0.721 | 0.61 MB | 0.010 | 6.434
tbm | 0.20 MB | 0.002 | 2.733 | 1.51 MB | 0.015 | 38.538
* Test 8
select prepare(
1000000, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within a page
50, -- # of consecutive pages having dead tuples
100 -- page interval
);
There are 50 consecutive pages that have 100 dead tuples at every 100 pages.
name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 286.11 MB | 0.184 | 67.233 | 2861.03 MB | 1.743 | 979.070
intset | 46.88 MB | 0.389 | 35.176 | 468.67 MB | 3.698 | 505.322
radix | 21.82 MB | 0.116 | 6.160 | 186.86 MB | 0.891 | 117.730
rtbm | 18.02 MB | 0.182 | 5.909 | 160.02 MB | 1.870 | 112.550
svtm | 4.28 MB | 0.152 | 3.213 | 37.60 MB | 1.383 | 79.073
tbm | 48.01 MB | 0.265 | 6.673 | 384.01 MB | 2.586 | 101.327
Overall, 'svtm' is faster and consumes less memory. 'radix' tree also
has good performance and memory usage.
From these results, svtm is the best data structure among proposed
ideas for dead tuple storage used during lazy vacuum in terms of
performance and memory usage. I think it can support iteration by
extracting the offset of dead tuples for each block while iterating
chunks.
Apart from performance and memory usage points of view, we also need
to consider the reusability of the code. When I started this thread, I
thought the best data structure would be the one optimized for
vacuum's dead tuple storage. However, if we can use a data structure
that can also be used in general, we can use it also for other
purposes. Moreover, if it's too optimized for the current TID system
(32 bits block number, 16 bits offset number, maximum block/offset
number, etc.) it may become a blocker for future changes.
In that sense, radix tree also seems good since it can also be used in
gist vacuum as a replacement for intset, or a replacement for hash
table for shared buffer as discussed before. Are there any other use
cases? On the other hand, I’m concerned that radix tree would be an
over-engineering in terms of vacuum's dead tuples storage since the
dead tuple storage is static data and requires only lookup operation,
so if we want to use radix tree as dead tuple storage, I'd like to see
further use cases.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Masahiko Sawada писал 2021-07-27 07:06:
On Mon, Jul 26, 2021 at 11:01 PM Masahiko Sawada
<sawada.mshk@gmail.com> wrote:I'll experiment with the proposed ideas including this idea in more
scenarios and share the results tomorrow.I've done some benchmarks for proposed data structures. In this trial,
I've done with the scenario where dead tuples are concentrated on a
particular range of table blocks (test 5-8), in addition to the
scenarios I've done in the previous trial. Also, I've done benchmarks
of each scenario while increasing table size. In the first test, the
maximum block number of the table is 1,000,000 (i.g., 8GB table) and
in the second test, it's 10,000,000 (80GB table). We can see how
performance and memory consumption changes with a large-scale table.
Here are the results:* Test 1
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
1, -- # of consecutive pages having dead tuples
20 -- page interval
);name | attach | attach | shuffled | size_x10 | attach_x10|
shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 57.23 MB | 0.040 | 98.613 | 572.21 MB | 0.387 |
1521.981
intset | 46.88 MB | 0.114 | 75.944 | 468.67 MB | 0.961 |
997.760
radix | 40.26 MB | 0.102 | 18.427 | 336.64 MB | 0.797 |
266.146
rtbm | 64.02 MB | 0.234 | 22.443 | 512.02 MB | 2.230 |
275.143
svtm | 27.28 MB | 0.060 | 13.568 | 274.07 MB | 0.476 |
211.073
tbm | 96.01 MB | 0.273 | 10.347 | 768.01 MB | 2.882 |
128.103* Test 2
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);name | attach | attach | shuffled | size_x10 | attach_x10|
shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 57.23 MB | 0.041 | 4.757 | 572.21 MB | 0.344 |
71.228
intset | 46.88 MB | 0.127 | 3.762 | 468.67 MB | 1.093 |
49.573
radix | 9.95 MB | 0.048 | 0.679 | 82.57 MB | 0.371 |
16.211
rtbm | 34.02 MB | 0.179 | 0.534 | 288.02 MB | 2.092 |
8.693
svtm | 5.78 MB | 0.043 | 0.239 | 54.60 MB | 0.342 |
7.759
tbm | 96.01 MB | 0.274 | 0.521 | 768.01 MB | 2.685 |
6.360* Test 3
select prepare(
1000000, -- max block
2, -- # of dead tuples per page
100, -- dead tuples interval within a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);name | attach | attach | shuffled | size_x10 | attach_x10|
shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 11.45 MB | 0.009 | 57.698 | 114.45 MB | 0.076 |
1045.639
intset | 15.63 MB | 0.031 | 46.083 | 156.23 MB | 0.243 |
848.525
radix | 40.26 MB | 0.063 | 13.755 | 336.64 MB | 0.501 |
223.413
rtbm | 36.02 MB | 0.123 | 11.527 | 320.02 MB | 1.843 |
180.977
svtm | 9.28 MB | 0.053 | 9.631 | 92.59 MB | 0.438 |
212.626
tbm | 96.01 MB | 0.228 | 10.381 | 768.01 MB | 2.258 |
126.630* Test 4
select prepare(
1000000, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);name | attach | attach | shuffled | size_x10 | attach_x10|
shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 572.21 MB | 0.367 | 78.047 | 5722.05 MB | 3.942 |
1154.776
intset | 93.74 MB | 0.777 | 45.146 | 937.34 MB | 7.716 |
643.708
radix | 40.26 MB | 0.203 | 9.015 | 336.64 MB | 1.775 |
133.294
rtbm | 36.02 MB | 0.369 | 5.639 | 320.02 MB | 3.823 |
88.832
svtm | 7.28 MB | 0.294 | 3.891 | 73.60 MB | 2.690 |
103.744
tbm | 96.01 MB | 0.534 | 5.223 | 768.01 MB | 5.679 |
60.632* Test 5
select prepare(
1000000, -- max block
150, -- # of dead tuples per page
1, -- dead tuples interval within a page
10000, -- # of consecutive pages having dead tuples
20000 -- page interval
);There are 10000 consecutive pages that have 150 dead tuples at every
20000 pages.name | attach | attach | shuffled | size_x10 | attach_x10|
shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 429.16 MB | 0.274 | 75.664 | 4291.54 MB | 3.067 |
1259.501
intset | 46.88 MB | 0.559 | 36.449 | 468.67 MB | 4.565 |
517.445
radix | 20.26 MB | 0.166 | 8.466 | 196.90 MB | 1.273 |
166.587
rtbm | 18.02 MB | 0.242 | 8.491 | 160.02 MB | 2.407 |
171.725
svtm | 3.66 MB | 0.243 | 3.635 | 37.10 MB | 2.022 |
86.165
tbm | 48.01 MB | 0.344 | 9.763 | 384.01 MB | 3.327 |
151.824* Test 6
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
10000, -- # of consecutive pages having dead tuples
20000 -- page interval
);There are 10000 consecutive pages that have 10 dead tuples at every
20000 pages.name | attach | attach | shuffled | size_x10 | attach_x10|
shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 28.62 MB | 0.022 | 2.791 | 286.11 MB | 0.170 |
46.920
intset | 23.45 MB | 0.061 | 2.156 | 234.34 MB | 0.501 |
32.577
radix | 5.04 MB | 0.026 | 0.433 | 48.57 MB | 0.191 |
11.060
rtbm | 17.02 MB | 0.074 | 0.533 | 144.02 MB | 0.954 |
11.502
svtm | 3.16 MB | 0.023 | 0.206 | 27.60 MB | 0.175 |
4.886
tbm | 48.01 MB | 0.132 | 0.656 | 384.01 MB | 1.284 |
10.231* Test 7
select prepare(
1000000, -- max block
150, -- # of dead tuples per page
1, -- dead tuples interval within a page
1000, -- # of consecutive pages having dead tuples
999000 -- page interval
);There are pages that have 150 dead tuples at first 1000 blocks and
last 1000 blocks.name | attach | attach | shuffled | size_x10 | attach_x10|
shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 1.72 MB | 0.002 | 7.507 | 17.17 MB | 0.011 |
76.510
intset | 0.20 MB | 0.003 | 6.742 | 1.89 MB | 0.022 |
52.122
radix | 0.20 MB | 0.001 | 1.023 | 1.07 MB | 0.007 |
12.023
rtbm | 0.15 MB | 0.001 | 2.637 | 0.65 MB | 0.009 |
34.528
svtm | 0.52 MB | 0.002 | 0.721 | 0.61 MB | 0.010 |
6.434
tbm | 0.20 MB | 0.002 | 2.733 | 1.51 MB | 0.015 |
38.538* Test 8
select prepare(
1000000, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within a page
50, -- # of consecutive pages having dead tuples
100 -- page interval
);There are 50 consecutive pages that have 100 dead tuples at every 100
pages.name | attach | attach | shuffled | size_x10 | attach_x10|
shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 286.11 MB | 0.184 | 67.233 | 2861.03 MB | 1.743 |
979.070
intset | 46.88 MB | 0.389 | 35.176 | 468.67 MB | 3.698 |
505.322
radix | 21.82 MB | 0.116 | 6.160 | 186.86 MB | 0.891 |
117.730
rtbm | 18.02 MB | 0.182 | 5.909 | 160.02 MB | 1.870 |
112.550
svtm | 4.28 MB | 0.152 | 3.213 | 37.60 MB | 1.383 |
79.073
tbm | 48.01 MB | 0.265 | 6.673 | 384.01 MB | 2.586 |
101.327Overall, 'svtm' is faster and consumes less memory. 'radix' tree also
has good performance and memory usage.From these results, svtm is the best data structure among proposed
ideas for dead tuple storage used during lazy vacuum in terms of
performance and memory usage. I think it can support iteration by
extracting the offset of dead tuples for each block while iterating
chunks.Apart from performance and memory usage points of view, we also need
to consider the reusability of the code. When I started this thread, I
thought the best data structure would be the one optimized for
vacuum's dead tuple storage. However, if we can use a data structure
that can also be used in general, we can use it also for other
purposes. Moreover, if it's too optimized for the current TID system
(32 bits block number, 16 bits offset number, maximum block/offset
number, etc.) it may become a blocker for future changes.In that sense, radix tree also seems good since it can also be used in
gist vacuum as a replacement for intset, or a replacement for hash
table for shared buffer as discussed before. Are there any other use
cases? On the other hand, I’m concerned that radix tree would be an
over-engineering in terms of vacuum's dead tuples storage since the
dead tuple storage is static data and requires only lookup operation,
so if we want to use radix tree as dead tuple storage, I'd like to see
further use cases.
I can evolve svtm to transparent intset replacement certainly. Using
same trick from radix_to_key it will store tids efficiently:
shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
tid_i = ItemPointerGetOffsetNumber(tid);
tid_i |= ItemPointerGetBlockNumber(tid) << shift;
Will do today's evening.
regards
Yura Sokolov aka funny_falcon
Hi,
On 2021-07-25 19:07:18 +0300, Yura Sokolov wrote:
I've dreamed to write more compact structure for vacuum for three
years, but life didn't give me a time to.Let me join to friendly competition.
I've bet on HATM approach: popcount-ing bitmaps for non-empty elements.
My concern with several of the proposals in this thread is that they
over-optimize for this specific case. It's not actually that crucial to
have a crazily optimized vacuum dead tid storage datatype. Having
something more general that also performs reasonably for the dead tuple
storage, but also performs well in a number of other cases, makes a lot
more sense to me.
(Bad radix result probably due to smaller cache in notebook's CPU ?)
Probably largely due to the node dispatch. a) For some reason gcc likes
jump tables too much, I get better numbers when disabling those b) the
node type dispatch should be stuffed into the low bits of the pointer.
select prepare(1000000, 2, 100, 1);
attach size shuffled
array 6ms 12MB 53.42s
intset 23ms 16MB 54.99s
rtbm 115ms 38MB 8.19s
tbm 186ms 100MB 8.37s
vtbm 105ms 59MB 9.08s
radix 64ms 42MB 10.41s
svtm 73ms 10MB 7.49s
Test4
select prepare(1000000, 100, 1, 1);
attach size shuffled
array 304ms 600MB 75.12s
intset 775ms 98MB 47.49s
rtbm 356ms 38MB 4.11s
tbm 539ms 100MB 4.20s
vtbm 493ms 42MB 4.44s
radix 263ms 42MB 6.05s
svtm 360ms 8MB 3.49sTherefore Specialized Vaccum Tid Map always consumes least memory amount
and usually faster.
Impressive.
Greetings,
Andres Freund
Hi,
On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote:
Apart from performance and memory usage points of view, we also need
to consider the reusability of the code. When I started this thread, I
thought the best data structure would be the one optimized for
vacuum's dead tuple storage. However, if we can use a data structure
that can also be used in general, we can use it also for other
purposes. Moreover, if it's too optimized for the current TID system
(32 bits block number, 16 bits offset number, maximum block/offset
number, etc.) it may become a blocker for future changes.
Indeed.
In that sense, radix tree also seems good since it can also be used in
gist vacuum as a replacement for intset, or a replacement for hash
table for shared buffer as discussed before. Are there any other use
cases?
Yes, I think there are. Whenever there is some spatial locality it has a
decent chance of winning over a hash table, and it will most of the time
win over ordered datastructures like rbtrees (which perform very poorly
due to the number of branches and pointer dispatches). There's plenty
hashtables, e.g. for caches, locks, etc, in PG that have a medium-high
degree of locality, so I'd expect a few potential uses. When adding
"tree compression" (i.e. skip inner nodes that have a single incoming &
outgoing node) radix trees even can deal quite performantly with
variable width keys.
On the other hand, I’m concerned that radix tree would be an
over-engineering in terms of vacuum's dead tuples storage since the
dead tuple storage is static data and requires only lookup operation,
so if we want to use radix tree as dead tuple storage, I'd like to see
further use cases.
I don't think we should rely on the read-only-ness. It seems pretty
clear that we'd want parallel dead-tuple scans at a point not too far
into the future?
Greetings,
Andres Freund
On Thu, Jul 29, 2021 at 3:53 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote:
Apart from performance and memory usage points of view, we also need
to consider the reusability of the code. When I started this thread, I
thought the best data structure would be the one optimized for
vacuum's dead tuple storage. However, if we can use a data structure
that can also be used in general, we can use it also for other
purposes. Moreover, if it's too optimized for the current TID system
(32 bits block number, 16 bits offset number, maximum block/offset
number, etc.) it may become a blocker for future changes.Indeed.
In that sense, radix tree also seems good since it can also be used in
gist vacuum as a replacement for intset, or a replacement for hash
table for shared buffer as discussed before. Are there any other use
cases?Yes, I think there are. Whenever there is some spatial locality it has a
decent chance of winning over a hash table, and it will most of the time
win over ordered datastructures like rbtrees (which perform very poorly
due to the number of branches and pointer dispatches). There's plenty
hashtables, e.g. for caches, locks, etc, in PG that have a medium-high
degree of locality, so I'd expect a few potential uses. When adding
"tree compression" (i.e. skip inner nodes that have a single incoming &
outgoing node) radix trees even can deal quite performantly with
variable width keys.
Good point.
On the other hand, I’m concerned that radix tree would be an
over-engineering in terms of vacuum's dead tuples storage since the
dead tuple storage is static data and requires only lookup operation,
so if we want to use radix tree as dead tuple storage, I'd like to see
further use cases.I don't think we should rely on the read-only-ness. It seems pretty
clear that we'd want parallel dead-tuple scans at a point not too far
into the future?
Indeed. Given that the radix tree itself has other use cases, I have
no concern about using radix tree for vacuum's dead tuples storage. It
will be better to have one that can be generally used and has some
optimizations that are helpful also for vacuum's use case, rather than
having one that is very optimized only for vacuum's use case.
During the performance benchmark, I found some bugs in the radix tree
implementation. Also, we need the functionality of tree iteration, and
if we have the radix tree in the source tree as a general library, we
need some changes since the current implementation seems to be for a
replacement for shared buffer’s hash table. I'll try to work on those
stuff as PoC if you don't. What do you think?
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Masahiko Sawada писал 2021-07-29 12:11:
On Thu, Jul 29, 2021 at 3:53 AM Andres Freund <andres@anarazel.de>
wrote:Hi,
On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote:
Apart from performance and memory usage points of view, we also need
to consider the reusability of the code. When I started this thread, I
thought the best data structure would be the one optimized for
vacuum's dead tuple storage. However, if we can use a data structure
that can also be used in general, we can use it also for other
purposes. Moreover, if it's too optimized for the current TID system
(32 bits block number, 16 bits offset number, maximum block/offset
number, etc.) it may become a blocker for future changes.Indeed.
In that sense, radix tree also seems good since it can also be used in
gist vacuum as a replacement for intset, or a replacement for hash
table for shared buffer as discussed before. Are there any other use
cases?Yes, I think there are. Whenever there is some spatial locality it has
a
decent chance of winning over a hash table, and it will most of the
time
win over ordered datastructures like rbtrees (which perform very
poorly
due to the number of branches and pointer dispatches). There's plenty
hashtables, e.g. for caches, locks, etc, in PG that have a medium-high
degree of locality, so I'd expect a few potential uses. When adding
"tree compression" (i.e. skip inner nodes that have a single incoming
&
outgoing node) radix trees even can deal quite performantly with
variable width keys.Good point.
On the other hand, I’m concerned that radix tree would be an
over-engineering in terms of vacuum's dead tuples storage since the
dead tuple storage is static data and requires only lookup operation,
so if we want to use radix tree as dead tuple storage, I'd like to see
further use cases.I don't think we should rely on the read-only-ness. It seems pretty
clear that we'd want parallel dead-tuple scans at a point not too far
into the future?Indeed. Given that the radix tree itself has other use cases, I have
no concern about using radix tree for vacuum's dead tuples storage. It
will be better to have one that can be generally used and has some
optimizations that are helpful also for vacuum's use case, rather than
having one that is very optimized only for vacuum's use case.
Main portion of svtm that leads to memory saving is compression of many
pages at once (CHUNK). It could be combined with radix as a storage for
pointers to CHUNKs.
For a moment I'm benchmarking IntegerSet replacement based on Trie (HATM
like)
and CHUNK compression, therefore datastructure could be used for gist
vacuum as well.
Since it is generic (allows to index all 64bit) it lacks of trick used
to speedup svtm. Still on 10x test it is faster than radix.
I'll send result later today after all benchmarks complete.
And I'll try then to make mix of radix and CHUNK compression.
During the performance benchmark, I found some bugs in the radix tree
implementation.
There is a bug in radix_to_key_off as well:
tid_i |= ItemPointerGetBlockNumber(tid) << shift;
ItemPointerGetBlockNumber returns uint32, therefore result after shift
is uint32 as well.
It leads to lesser memory consumption (and therefore better times) on
10x test, when page number exceed 2^23 (8M). It still produce "correct"
result for test since every page is filled in the same way.
Could you push your fixes for radix, please?
regards,
Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com
On Thu, Jul 29, 2021 at 8:03 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
Masahiko Sawada писал 2021-07-29 12:11:
On Thu, Jul 29, 2021 at 3:53 AM Andres Freund <andres@anarazel.de>
wrote:Hi,
On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote:
Apart from performance and memory usage points of view, we also need
to consider the reusability of the code. When I started this thread, I
thought the best data structure would be the one optimized for
vacuum's dead tuple storage. However, if we can use a data structure
that can also be used in general, we can use it also for other
purposes. Moreover, if it's too optimized for the current TID system
(32 bits block number, 16 bits offset number, maximum block/offset
number, etc.) it may become a blocker for future changes.Indeed.
In that sense, radix tree also seems good since it can also be used in
gist vacuum as a replacement for intset, or a replacement for hash
table for shared buffer as discussed before. Are there any other use
cases?Yes, I think there are. Whenever there is some spatial locality it has
a
decent chance of winning over a hash table, and it will most of the
time
win over ordered datastructures like rbtrees (which perform very
poorly
due to the number of branches and pointer dispatches). There's plenty
hashtables, e.g. for caches, locks, etc, in PG that have a medium-high
degree of locality, so I'd expect a few potential uses. When adding
"tree compression" (i.e. skip inner nodes that have a single incoming
&
outgoing node) radix trees even can deal quite performantly with
variable width keys.Good point.
On the other hand, I’m concerned that radix tree would be an
over-engineering in terms of vacuum's dead tuples storage since the
dead tuple storage is static data and requires only lookup operation,
so if we want to use radix tree as dead tuple storage, I'd like to see
further use cases.I don't think we should rely on the read-only-ness. It seems pretty
clear that we'd want parallel dead-tuple scans at a point not too far
into the future?Indeed. Given that the radix tree itself has other use cases, I have
no concern about using radix tree for vacuum's dead tuples storage. It
will be better to have one that can be generally used and has some
optimizations that are helpful also for vacuum's use case, rather than
having one that is very optimized only for vacuum's use case.Main portion of svtm that leads to memory saving is compression of many
pages at once (CHUNK). It could be combined with radix as a storage for
pointers to CHUNKs.For a moment I'm benchmarking IntegerSet replacement based on Trie (HATM
like)
and CHUNK compression, therefore datastructure could be used for gist
vacuum as well.Since it is generic (allows to index all 64bit) it lacks of trick used
to speedup svtm. Still on 10x test it is faster than radix.
BTW, how does svtm work when we add two sets of dead tuple TIDs to one
svtm? Dead tuple TIDs are unique sets but those sets could have TIDs
of the different offsets on the same block. The case I imagine is the
idea discussed on this thread[1]/messages/by-id/CA+TgmoZgapzekbTqdBrcH8O8Yifi10_nB7uWLB8ajAhGL21M6A@mail.gmail.com. With this idea, we store the
collected dead tuple TIDs somewhere and skip index vacuuming for some
reason (index skipping optimization, failsafe mode, or interruptions
etc.). Then, in the next lazy vacuum timing, we load the dead tuple
TIDs and start to scan the heap. During the heap scan in the second
lazy vacuum, it's possible that new dead tuples will be found on the
pages that we have already stored in svtm during the first lazy
vacuum. How can we efficiently update the chunk in the svtm?
Regards,
[1]: /messages/by-id/CA+TgmoZgapzekbTqdBrcH8O8Yifi10_nB7uWLB8ajAhGL21M6A@mail.gmail.com
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Masahiko Sawada писал 2021-07-29 17:29:
On Thu, Jul 29, 2021 at 8:03 PM Yura Sokolov <y.sokolov@postgrespro.ru>
wrote:Masahiko Sawada писал 2021-07-29 12:11:
On Thu, Jul 29, 2021 at 3:53 AM Andres Freund <andres@anarazel.de>
wrote:Hi,
On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote:
Apart from performance and memory usage points of view, we also need
to consider the reusability of the code. When I started this thread, I
thought the best data structure would be the one optimized for
vacuum's dead tuple storage. However, if we can use a data structure
that can also be used in general, we can use it also for other
purposes. Moreover, if it's too optimized for the current TID system
(32 bits block number, 16 bits offset number, maximum block/offset
number, etc.) it may become a blocker for future changes.Indeed.
In that sense, radix tree also seems good since it can also be used in
gist vacuum as a replacement for intset, or a replacement for hash
table for shared buffer as discussed before. Are there any other use
cases?Yes, I think there are. Whenever there is some spatial locality it has
a
decent chance of winning over a hash table, and it will most of the
time
win over ordered datastructures like rbtrees (which perform very
poorly
due to the number of branches and pointer dispatches). There's plenty
hashtables, e.g. for caches, locks, etc, in PG that have a medium-high
degree of locality, so I'd expect a few potential uses. When adding
"tree compression" (i.e. skip inner nodes that have a single incoming
&
outgoing node) radix trees even can deal quite performantly with
variable width keys.Good point.
On the other hand, I’m concerned that radix tree would be an
over-engineering in terms of vacuum's dead tuples storage since the
dead tuple storage is static data and requires only lookup operation,
so if we want to use radix tree as dead tuple storage, I'd like to see
further use cases.I don't think we should rely on the read-only-ness. It seems pretty
clear that we'd want parallel dead-tuple scans at a point not too far
into the future?Indeed. Given that the radix tree itself has other use cases, I have
no concern about using radix tree for vacuum's dead tuples storage. It
will be better to have one that can be generally used and has some
optimizations that are helpful also for vacuum's use case, rather than
having one that is very optimized only for vacuum's use case.Main portion of svtm that leads to memory saving is compression of
many
pages at once (CHUNK). It could be combined with radix as a storage
for
pointers to CHUNKs., buteFor a moment I'm benchmarking IntegerSet replacement based on Trie
(HATM
like)
and CHUNK compression, therefore datastructure could be used for gist
vacuum as well.Since it is generic (allows to index all 64bit) it lacks of trick used
to speedup svtm. Still on 10x test it is faster than radix.
I've attached IntegerSet2 patch for pgtools repo and benchmark results.
Branch https://github.com/funny-falcon/pgtools/tree/integerset2
SVTM is measured with couple of changes from commit
5055ef72d23482dd3e11ce
in that branch: 1) more often compress bitmap, but slower, 2) couple of
popcount tricks.
IntegerSet consists of trie index to CHUNKS. CHUNKS is compressed bitmap
of 2^15 (6+9) bits (almost like in SVTM, but for fixed bit width).
Well, IntegerSet2 is always faster than IntegerSet and always uses
significantly less memory (radix uses more memory than IntegerSet in
couple of tests and uses comparable in others).
IntegerSet2 is not always faster than radix. It is more like radix.
That it because both are generic prefix trees with comparable amount of
memory accesses. SVTM did the trick being not multilevel prefix tree,
but
just 1 level bitmap index to chunks.
I believe, trie part of IntegerSet could be replaced with radix.
Ie use radix as storage for pointers to CHUNKS.
BTW, how does svtm work when we add two sets of dead tuple TIDs to one
svtm? Dead tuple TIDs are unique sets but those sets could have TIDs
of the different offsets on the same block. The case I imagine is the
idea discussed on this thread[1]. With this idea, we store the
collected dead tuple TIDs somewhere and skip index vacuuming for some
reason (index skipping optimization, failsafe mode, or interruptions
etc.). Then, in the next lazy vacuum timing, we load the dead tuple
TIDs and start to scan the heap. During the heap scan in the second
lazy vacuum, it's possible that new dead tuples will be found on the
pages that we have already stored in svtm during the first lazy
vacuum. How can we efficiently update the chunk in the svtm?
If we store tidmap to disk, then it will be serialized. Since SVTM/
IntegerSet2 are ordered, they could be loaded in order. Then we
can just merge tuples in per page basis: deserialize page (or CHUNK),
put new tuples, store again. Since both scan (scan of serilized map
and scan of table) are in order, merging will be cheap enough.
SVTM and IntegerSet2 already works in "buffered" way on insertion.
(As well as IntegerSet that also does compression but in small parts).
regards,
Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com
Attachments:
0001-integerset2.patchtext/x-diff; name=0001-integerset2.patchDownload
From c555983109cf202a2bd395de77711f302b7a5024 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <funny.falcon@gmail.com>
Date: Wed, 28 Jul 2021 17:21:02 +0300
Subject: [PATCH] integerset2
---
bdbench/Makefile | 2 +-
bdbench/bdbench--1.0.sql | 5 +
bdbench/bdbench.c | 86 +++-
bdbench/bench.sql | 2 +
bdbench/integerset2.c | 887 +++++++++++++++++++++++++++++++++++++++
bdbench/integerset2.h | 15 +
6 files changed, 994 insertions(+), 3 deletions(-)
create mode 100644 bdbench/integerset2.c
create mode 100644 bdbench/integerset2.h
diff --git a/bdbench/Makefile b/bdbench/Makefile
index a6f758f..0b00211 100644
--- a/bdbench/Makefile
+++ b/bdbench/Makefile
@@ -2,7 +2,7 @@
MODULE_big = bdbench
DATA = bdbench--1.0.sql
-OBJS = bdbench.o vtbm.o rtbm.o radix.o svtm.o
+OBJS = bdbench.o vtbm.o rtbm.o radix.o svtm.o integerset2.o
EXTENSION = bdbench
REGRESS= bdbench
diff --git a/bdbench/bdbench--1.0.sql b/bdbench/bdbench--1.0.sql
index 0ba10a8..ae15514 100644
--- a/bdbench/bdbench--1.0.sql
+++ b/bdbench/bdbench--1.0.sql
@@ -115,3 +115,8 @@ CREATE FUNCTION radix_run_tests()
RETURNS void
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT VOLATILE;
+
+CREATE FUNCTION intset2_run_tests()
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE;
diff --git a/bdbench/bdbench.c b/bdbench/bdbench.c
index d15526e..883099c 100644
--- a/bdbench/bdbench.c
+++ b/bdbench/bdbench.c
@@ -22,6 +22,7 @@
#include "rtbm.h"
#include "radix.h"
#include "svtm.h"
+#include "integerset2.h"
//#define DEBUG_DUMP_MATCHED 1
@@ -93,6 +94,7 @@ PG_FUNCTION_INFO_V1(bench);
PG_FUNCTION_INFO_V1(test_generate_tid);
PG_FUNCTION_INFO_V1(rtbm_test);
PG_FUNCTION_INFO_V1(radix_run_tests);
+PG_FUNCTION_INFO_V1(intset2_run_tests);
PG_FUNCTION_INFO_V1(prepare);
/*
@@ -159,6 +161,14 @@ static bool svtm_reaped(LVTestType *lvtt, ItemPointer itemptr);
static Size svtm_mem_usage(LVTestType *lvtt);
static void svtm_load(SVTm *tbm, ItemPointerData *itemptrs, int nitems);
+/* intset2 */
+static void intset2_init(LVTestType *lvtt, uint64 nitems);
+static void intset2_fini(LVTestType *lvtt);
+static void intset2_attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk,
+ BlockNumber maxblk, OffsetNumber maxoff);
+static bool intset2_reaped(LVTestType *lvtt, ItemPointer itemptr);
+static Size intset2_mem_usage(LVTestType *lvtt);
+
/* Misc functions */
static void generate_index_tuples(uint64 nitems, BlockNumber minblk,
@@ -185,7 +195,7 @@ static void load_rtbm(RTbm *vtbm, ItemPointerData *itemptrs, int nitems);
.mem_usage_fn = n##_mem_usage, \
}
-#define TEST_SUBJECT_TYPES 7
+#define TEST_SUBJECT_TYPES 8
static LVTestType LVTestSubjects[TEST_SUBJECT_TYPES] =
{
DECLARE_SUBJECT(array),
@@ -194,7 +204,8 @@ static LVTestType LVTestSubjects[TEST_SUBJECT_TYPES] =
DECLARE_SUBJECT(vtbm),
DECLARE_SUBJECT(rtbm),
DECLARE_SUBJECT(radix),
- DECLARE_SUBJECT(svtm)
+ DECLARE_SUBJECT(svtm),
+ DECLARE_SUBJECT(intset2)
};
static bool
@@ -843,6 +854,69 @@ svtm_load(SVTm *svtm, ItemPointerData *itemptrs, int nitems)
svtm_finalize_addition(svtm);
}
+/* ------------ intset2 ----------- */
+static void
+intset2_init(LVTestType *lvtt, uint64 nitems)
+{
+ MemoryContext old_ctx;
+
+ lvtt->mcxt = AllocSetContextCreate(TopMemoryContext,
+ "intset2 bench",
+ ALLOCSET_DEFAULT_SIZES);
+ old_ctx = MemoryContextSwitchTo(lvtt->mcxt);
+ lvtt->private = intset2_create();
+ MemoryContextSwitchTo(old_ctx);
+}
+
+static void
+intset2_fini(LVTestType *lvtt)
+{
+ if (lvtt->private != NULL)
+ intset2_free(lvtt->private);
+}
+
+static inline uint64
+intset2_encode(ItemPointer tid)
+{
+ uint64 tid_i;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+
+ Assert(ItemPointerGetOffsetNumber(tid)>0);
+ tid_i = ItemPointerGetOffsetNumber(tid) - 1;
+ tid_i |= (uint64)ItemPointerGetBlockNumber(tid) << shift;
+
+ return tid_i;
+}
+
+static void
+intset2_attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk,
+ BlockNumber maxblk, OffsetNumber maxoff)
+{
+ uint64 i;
+ MemoryContext oldcontext = MemoryContextSwitchTo(lvtt->mcxt);
+
+ for (i = 0; i < nitems; i++)
+ {
+ intset2_add_member(lvtt->private,
+ intset2_encode(DeadTuples_orig->itemptrs + i));
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+}
+
+static bool
+intset2_reaped(LVTestType *lvtt, ItemPointer itemptr)
+{
+ return intset2_is_member(lvtt->private, intset2_encode(itemptr));
+}
+
+static uint64
+intset2_mem_usage(LVTestType *lvtt)
+{
+ //svtm_stats((SVTm *) lvtt->private);
+ return MemoryContextMemAllocated(lvtt->mcxt, true);
+}
+
static void
attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk, BlockNumber maxblk,
@@ -1229,3 +1303,11 @@ radix_run_tests(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+
+Datum
+intset2_run_tests(PG_FUNCTION_ARGS)
+{
+ intset2_test_1();
+
+ PG_RETURN_VOID();
+}
diff --git a/bdbench/bench.sql b/bdbench/bench.sql
index b303591..01ee846 100644
--- a/bdbench/bench.sql
+++ b/bdbench/bench.sql
@@ -17,6 +17,7 @@ select 'tbm', attach_dead_tuples('tbm');
select 'vtbm', attach_dead_tuples('vtbm');
select 'radix', attach_dead_tuples('radix');
select 'svtm', attach_dead_tuples('svtm');
+select 'intset2', attach_dead_tuples('intset2');
-- Do benchmark of lazy_tid_reaped.
select 'array bench', bench('array');
@@ -26,6 +27,7 @@ select 'tbm bench', bench('tbm');
select 'vtbm bench', bench('vtbm');
select 'radix', bench('radix');
select 'svtm', bench('svtm');
+select 'intset2', bench('intset2');
-- Check the memory usage.
select * from pg_backend_memory_contexts where name ~ 'bench' or name = 'TopMemoryContext' order by name;
diff --git a/bdbench/integerset2.c b/bdbench/integerset2.c
new file mode 100644
index 0000000..441e224
--- /dev/null
+++ b/bdbench/integerset2.c
@@ -0,0 +1,887 @@
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "lib/stringinfo.h"
+#include "port/pg_bitutils.h"
+#include "nodes/bitmapset.h"
+
+#include "integerset2.h"
+
+#define ONE ((bitmapword)1)
+#if BITS_PER_BITMAPWORD == 64
+#define SHIFT 6
+#define pg_popcountW pg_popcount64
+#else
+#define SHIFT 5
+#define pg_popcountW pg_popcount32
+#endif
+#define START(x) ((x) >> SHIFT)
+#define STARTN(x, n) ((x) >> (SHIFT*(n)))
+#define NBIT(x) ((x) & (((uint64)1 << SHIFT)-1))
+#define BIT(x) (ONE << NBIT(x))
+#define NBITN(x, n) NBIT(STARTN(x, (n)-1))
+#define BITN(x, n) (ONE << NBITN((x), (n)))
+
+/*
+ * Compressed leaf bitmap is indexed with 2 level bitmap index with
+ * 1 byte in root level. Therefore there is 8 bytes in second level
+ * and 64 bytes in third level.
+ */
+#define LEAF_SHIFT (3+3+3)
+#define LEAF_BITS (1 << LEAF_SHIFT)
+#define LEAF_BYTES (LEAF_BITS / 8)
+#define LBYTE(x) (((x) / 8) & (LEAF_BYTES-1))
+#define LBIT(x) (1 << ((x) & 7));
+
+#define CHUNK_LEAFS BITS_PER_BITMAPWORD
+#define CHUNK_SHIFT (LEAF_SHIFT + SHIFT)
+#define CSTART(x) ((x) & ~(((uint64)1 << CHUNK_SHIFT)-1))
+#define CPOS(x) NBIT((x) >> LEAF_SHIFT)
+
+#define VAL_TO_PAGE(val) ((val) >> LEAF_SHIFT)
+#define VAL_TO_CHUNK(val) ((val) >> CHUNK_SHIFT)
+#define TRIE_LEVELS (64 / SHIFT)
+
+#define ISAllocBatch (1<<18)
+
+typedef struct IntsetAllocator IntsetAllocator;
+struct IntsetAllocator
+{
+ Size total_size;
+ Size alloc_size;
+ Size pos;
+ Size limit;
+ uint8 *current;
+ List *chunks;
+};
+
+/* TRIE (HAMT like) */
+typedef struct IntsetTrieVal IntsetTrieVal;
+typedef struct IntsetTrieElem IntsetTrieElem;
+typedef void* (*trie_alloc)(Size size, void *arg);
+typedef struct IntsetTrie IntsetTrie;
+
+struct IntsetTrieElem
+{
+ uint64 key;
+ bitmapword bitmap;
+ union
+ {
+ void *val;
+ IntsetTrieElem *children;
+ } p;
+};
+
+struct IntsetTrie
+{
+ trie_alloc alloc;
+ void *alloc_arg;
+
+ int root_level;
+ IntsetTrieElem root;
+ uint32 n[TRIE_LEVELS - 1];
+ IntsetTrieElem l[TRIE_LEVELS - 1][BITS_PER_BITMAPWORD];
+};
+
+struct IntsetTrieVal
+{
+ bitmapword bitmap;
+ void *val;
+};
+
+/* Intset */
+
+typedef enum IntsetLeafType IntsetLeafType;
+typedef struct IntsetLeafBitmap IntsetLeafBitmap;
+typedef struct IntsetLeafEmbed IntsetLeafEmbed;
+typedef union IntsetLeafHeader IntsetLeafHeader;
+/* alias for pointer */
+typedef IntsetLeafHeader IntsetChunk;
+typedef struct IntsetLeafBuilder IntsetLeafBuilder;
+typedef struct IntsetChunkBuilder IntsetChunkBuilder;
+
+#define bm2(b,c) (((b)<<1)|(c))
+enum IntsetLeafType {
+ LT_RAW = bm2(0, 0),
+ LT_INVERSE = bm2(0, 1),
+ LT_SPARSE = bm2(1, 0),
+ LT_EMBED = bm2(1, 1),
+};
+
+struct IntsetLeafBitmap
+{
+ IntsetLeafType type:2;
+ uint32 minbyte:6;
+ uint32 maxbyte:6;
+ uint32 offset:16;
+};
+
+struct IntsetLeafEmbed
+{
+ IntsetLeafType type:2;
+ uint32 v0:9;
+ uint32 v1:9;
+ uint32 v2:9;
+};
+
+union IntsetLeafHeader
+{
+ IntsetLeafBitmap b;
+ IntsetLeafEmbed e;
+ uint32 v;
+};
+
+StaticAssertDecl(sizeof(IntsetLeafBitmap) == sizeof(IntsetLeafEmbed),
+ "incompatible bit field packing");
+StaticAssertDecl(sizeof(IntsetLeafBitmap) == sizeof(uint32),
+ "incompatible bit field packing");
+
+
+struct IntsetLeafBuilder
+{
+ uint16 nvals;
+ uint16 embed[3];
+ uint8 minbyte;
+ uint8 maxbyte;
+ uint8 bytes[LEAF_BYTES];
+};
+
+struct IntsetChunkBuilder
+{
+ uint64 chunk;
+ bitmapword bitmap;
+ IntsetLeafBuilder leafs[CHUNK_LEAFS];
+};
+
+struct IntegerSet2
+{
+ uint64 firstvalue;
+ uint64 nvalues;
+
+ IntsetAllocator alloc;
+
+ IntsetChunkBuilder current;
+ IntsetTrie trie;
+};
+
+
+/* Allocator functions */
+
+static void *intset2_alloc(Size size, IntsetAllocator *alloc);
+static void intset2_alloc_free(IntsetAllocator *alloc);
+
+/* Trie functions */
+
+static inline void intset2_trie_init(IntsetTrie *trie,
+ trie_alloc alloc,
+ void* arg);
+static void intset2_trie_insert(IntsetTrie *trie,
+ uint64 key,
+ IntsetTrieVal val);
+static IntsetTrieVal intset2_trie_lookup(IntsetTrie *trie, uint64 key);
+
+/* Intset functions */
+
+static uint8 intset2_leafbuilder_add(IntsetLeafBuilder *leaf, uint64 v);
+static inline bool intset2_leafbuilder_is_member(IntsetLeafBuilder *leaf,
+ uint64 v);
+static uint8 intset2_chunkbuilder_add(IntsetChunkBuilder *chunk, uint64 v);
+static bool intset2_chunkbuilder_is_member(IntsetChunkBuilder *chunk,
+ uint64 v);
+static bool intset2_chunk_is_member(IntsetChunk *chunk,
+ bitmapword bitmap,
+ uint64 v);
+
+static void intset2_compress_current(IntegerSet2 *intset);
+
+static inline uint8 pg_popcount8(uint8 b);
+static inline uint8 pg_popcount8_lowbits(uint8 b, uint8 nbits);
+static inline uint8 pg_popcount_small(uint8 *b, uint8 len);
+static inline uint32 intset2_compact(uint8 *dest, uint8 *src, uint8 len, bool inverse);
+
+/* Allocator */
+
+static void*
+intset2_alloc(Size size, IntsetAllocator *alloc)
+{
+ Assert(size < ISAllocBatch);
+
+ size = MAXALIGN(size);
+
+ if (alloc->limit - alloc->pos < size)
+ {
+ alloc->current = palloc0(ISAllocBatch);
+ alloc->chunks = lappend(alloc->chunks, alloc->current);
+ alloc->pos = 0;
+ alloc->limit = ISAllocBatch;
+ alloc->total_size += ISAllocBatch;
+ }
+
+ alloc->pos += size;
+ alloc->alloc_size += size;
+ return alloc->current + (alloc->pos - size);
+}
+
+static void
+intset2_alloc_free(IntsetAllocator *alloc)
+{
+ list_free_deep(alloc->chunks);
+}
+
+/* Trie */
+
+static inline void
+intset2_trie_init(IntsetTrie *trie, trie_alloc alloc, void* arg)
+{
+ memset(trie, 0, sizeof(*trie));
+ trie->root_level = -1;
+ trie->alloc = alloc;
+ trie->alloc_arg = arg;
+}
+
+static void
+intset2_trie_insert(IntsetTrie *trie, uint64 key, IntsetTrieVal val)
+{
+ IntsetTrieElem *root = &trie->root;
+ IntsetTrieElem *chunk;
+ IntsetTrieElem *parent;
+ IntsetTrieElem insert;
+ int level = trie->root_level;
+
+ if (level == -1)
+ {
+ trie->root_level = 0;
+ root->key = key;
+ root->bitmap = val.bitmap;
+ root->p.val = val.val;
+ return;
+ }
+
+ Assert(root->key <= STARTN(key, level));
+ Assert(trie->root_level != 0 || root->key < key);
+
+ /* Adjust root level */
+ while (root->key != STARTN(key, level))
+ {
+ trie->l[level][0] = *root;
+ trie->n[level] = 1;
+ root->p.children = trie->l[level];
+ root->bitmap = BIT(root->key);
+ root->key >>= SHIFT;
+ level++;
+ }
+ trie->root_level = level;
+
+ /* Actual insert */
+ insert.key = key;
+ insert.bitmap = val.bitmap;
+ insert.p.val = val.val;
+
+ /*
+ * Iterate while we need to move current level to alloced
+ * space.
+ *
+ * Since we've fixed root in the loop above, we certainly
+ * will quit.
+ */
+ for (level = 0;; level++) {
+ IntsetTrieElem *alloced;
+ uint32 n = trie->n[level];
+ Size asize;
+
+ chunk = trie->l[level];
+ Assert(chunk[n-1].key <= insert.key);
+
+ if (level < trie->root_level-1)
+ parent = &trie->l[level+1][trie->n[level+1]-1];
+ else
+ parent = root;
+
+ Assert(pg_popcountW(parent->bitmap) == n);
+
+ if (parent->key == START(insert.key))
+ /* Yes, we are in the same chunk */
+ break;
+
+ /*
+ * We are not in the same chunk. We need to move
+ * layer to allocated space and start new one.
+ */
+ asize = n * sizeof(IntsetTrieElem);
+ alloced = trie->alloc(asize, trie->alloc_arg);
+ memmove(alloced, chunk, asize);
+ parent->p.children = alloced;
+
+ /* insert into this level */
+ memset(chunk, 0, sizeof(*chunk) * BITS_PER_BITMAPWORD);
+ chunk[0] = insert;
+ trie->n[level] = 1;
+
+ /* prepare insertion into upper level */
+ insert.bitmap = BIT(insert.key);
+ insert.p.children = chunk;
+ insert.key >>= SHIFT;
+ }
+
+ Assert((parent->bitmap & BIT(insert.key)) == 0);
+
+ parent->bitmap |= BIT(insert.key);
+ chunk[trie->n[level]] = insert;
+ trie->n[level]++;
+
+ Assert(pg_popcountW(parent->bitmap) == trie->n[level]);
+}
+
+static IntsetTrieVal
+intset2_trie_lookup(IntsetTrie *trie, uint64 key)
+{
+ IntsetTrieVal result = {0, NULL};
+ IntsetTrieElem *current = &trie->root;
+ int level = trie->root_level;
+
+ if (level == -1)
+ return result;
+
+ /* root is out of bound */
+ if (current->key != STARTN(key, level))
+ return result;
+
+ for (; level > 0; level--)
+ {
+ int n;
+ uint64 bit = BITN(key, level);
+
+ if ((current->bitmap & bit) == 0)
+ /* Not found */
+ return result;
+ n = pg_popcountW(current->bitmap & (bit-1));
+ current = ¤t->p.children[n];
+ }
+
+ Assert(current->key == key);
+
+ result.bitmap = current->bitmap;
+ result.val = current->p.val;
+
+ return result;
+}
+
+/* Intset */
+
+/* returns 1 if new element were added, 0 otherwise */
+static uint8
+intset2_leafbuilder_add(IntsetLeafBuilder *leaf, uint64 v)
+{
+ uint16 bv;
+ uint8 lbyte, lbit, missing;
+
+ bv = v % LEAF_BITS;
+ lbyte = LBYTE(bv);
+ lbit = LBIT(bv);
+
+ if (leaf->nvals < 3)
+ leaf->embed[leaf->nvals] = bv;
+ if (leaf->nvals == 0)
+ leaf->minbyte = leaf->maxbyte = lbyte;
+ else
+ {
+ Assert(lbyte >= leaf->maxbyte);
+ leaf->maxbyte = lbyte;
+ }
+
+ lbyte -= leaf->minbyte;
+
+ missing = (leaf->bytes[lbyte] & lbit) == 0;
+ leaf->bytes[lbyte] |= lbit;
+ leaf->nvals += missing;
+ return missing;
+}
+
+static inline bool
+intset2_leafbuilder_is_member(IntsetLeafBuilder *leaf, uint64 v)
+{
+ uint16 bv;
+ uint8 lbyte, lbit;
+
+ bv = v % LEAF_BITS;
+ lbyte = LBYTE(bv);
+ lbit = LBIT(bv);
+
+ /* we shouldn't be here unless we set something */
+ Assert(leaf->nvals != 0);
+
+ if (lbyte < leaf->minbyte || lbyte > leaf->maxbyte)
+ return false;
+ lbyte -= leaf->minbyte;
+ return (leaf->bytes[lbyte] & lbit) != 0;
+}
+
+static uint8
+intset2_chunkbuilder_add(IntsetChunkBuilder *chunk, uint64 v)
+{
+ IntsetLeafBuilder *leafs = chunk->leafs;
+
+ Assert(CSTART(v) == chunk->chunk);
+ chunk->bitmap |= (bitmapword)1<<CPOS(v);
+ return intset2_leafbuilder_add(&leafs[CPOS(v)], v);
+}
+
+static bool
+intset2_chunkbuilder_is_member(IntsetChunkBuilder *chunk, uint64 v)
+{
+ IntsetLeafBuilder *leafs = chunk->leafs;
+
+ Assert(CSTART(v) == chunk->chunk);
+ if ((chunk->bitmap & ((bitmapword)1<<CPOS(v))) == 0)
+ return false;
+ return intset2_leafbuilder_is_member(&leafs[CPOS(v)], v);
+}
+
+static bool
+intset2_chunk_is_member(IntsetChunk *chunk, bitmapword bitmap, uint64 v)
+{
+ IntsetLeafHeader h;
+
+ uint32 cpos;
+ bitmapword cbit;
+ uint8 *buf;
+ uint32 bv;
+ uint8 root;
+ uint8 lbyte;
+ uint8 l1bm;
+ uint8 l1len;
+ uint8 l1pos;
+ uint8 lbit;
+ bool found;
+ bool inverse;
+
+ cpos = CPOS(v);
+ cbit = ONE << cpos;
+
+ if ((bitmap & cbit) == 0)
+ return false;
+ h = chunk[pg_popcountW(bitmap & (cbit-1))];
+
+ bv = v % LEAF_BITS;
+ if (h.e.type == LT_EMBED)
+ return bv == h.e.v0 || bv == h.e.v1 || bv == h.e.v2;
+
+ lbyte = LBYTE(bv);
+ lbit = LBIT(bv);
+ buf = (uint8*)(chunk + pg_popcountW(bitmap)) + h.b.offset;
+
+ if (lbyte < h.b.minbyte || lbyte > h.b.maxbyte)
+ return false;
+ lbyte -= h.b.minbyte;
+
+ if (h.b.type == LT_RAW)
+ return (buf[lbyte] & lbit) != 0;
+
+ inverse = h.b.type == LT_INVERSE;
+
+ /*
+ * Bitmap is sparse, so we have to recalculate lbyte.
+ * lbyte = popcount(bits in level1 up to lbyte)
+ */
+ root = buf[0];
+ if ((root & (1<<(lbyte/8))) == 0)
+ return inverse;
+
+ /* Calculate position in sparse level1 index. */
+ l1pos = pg_popcount8_lowbits(root, lbyte/8);
+ l1bm = buf[1+l1pos];
+ if ((l1bm & (1<<(lbyte&7))) == 0)
+ return inverse;
+ /* Now we have to check bitmap byte itself */
+ /* Calculate length of sparse level1 index */
+ l1len = pg_popcount8(root);
+ /*
+ * Corrected lbyte position is count of bits set in the level1 upto
+ * our original position.
+ */
+ lbyte = pg_popcount_small(buf+1, l1pos) +
+ pg_popcount8_lowbits(l1bm, lbyte&7);
+ found = (buf[1+l1len+lbyte] & lbit) != 0;
+ return found != inverse;
+}
+
+IntegerSet2*
+intset2_create(void)
+{
+ IntegerSet2 *intset = palloc0(sizeof(IntegerSet2));
+
+ intset2_trie_init(&intset->trie,
+ (trie_alloc)intset2_alloc,
+ &intset->alloc);
+
+ return intset;
+}
+
+void
+intset2_free(IntegerSet2 *intset)
+{
+ intset2_alloc_free(&intset->alloc);
+ pfree(intset);
+}
+
+void
+intset2_add_member(IntegerSet2 *intset, uint64 v)
+{
+ uint64 cstart;
+ if (intset->nvalues == 0)
+ {
+ uint8 add;
+
+ intset->firstvalue = CSTART(v);
+ v -= intset->firstvalue;
+ add = intset2_chunkbuilder_add(&intset->current, v);
+ Assert(add == 1);
+ intset->nvalues += add;
+ return;
+ }
+
+ v -= intset->firstvalue;
+ cstart = CSTART(v);
+ Assert(cstart >= intset->current.chunk);
+ if (cstart != intset->current.chunk)
+ {
+ intset2_compress_current(intset);
+ intset->current.chunk = cstart;
+ }
+
+ intset->nvalues += intset2_chunkbuilder_add(&intset->current, v);
+}
+
+bool
+intset2_is_member(IntegerSet2 *intset, uint64 v)
+{
+ IntsetTrieVal trieval;
+
+ if (intset->nvalues == 0)
+ return false;
+
+ if (v < intset->firstvalue)
+ return false;
+
+ v -= intset->firstvalue;
+
+ if (intset->current.chunk < CSTART(v))
+ return false;
+
+ if (intset->current.chunk == CSTART(v))
+ return intset2_chunkbuilder_is_member(&intset->current, v);
+
+ trieval = intset2_trie_lookup(&intset->trie, v>>CHUNK_SHIFT);
+ return intset2_chunk_is_member(trieval.val, trieval.bitmap, v);
+}
+
+uint64
+intset2_num_entries(IntegerSet2 *intset)
+{
+ return intset->nvalues;
+}
+
+uint64
+intset2_memory_usage(IntegerSet2 *intset)
+{
+ /* we are missing alloc->chunks here */
+ return sizeof(IntegerSet2) + intset->alloc.total_size;
+}
+
+static void
+intset2_compress_current(IntegerSet2 *intset)
+{
+ IntsetChunkBuilder *bld = &intset->current;
+ IntsetLeafBuilder *leaf;
+ uint32 nheaders = 0;
+ IntsetLeafHeader headers[BITS_PER_BITMAPWORD];
+ IntsetLeafHeader h = {.v = 0};
+ IntsetTrieVal trieval = {0, NULL};
+ uint64 triekey;
+ uint32 hlen, totallen;
+ uint32 bufpos = 0;
+ uint32 i;
+ uint8 buffer[BITS_PER_BITMAPWORD * LEAF_BYTES];
+
+ for (i = 0; i < BITS_PER_BITMAPWORD; i++)
+ {
+ if ((bld->bitmap & (ONE<<i)) == 0)
+ continue;
+
+ leaf = &bld->leafs[i];
+ Assert(leaf->nvals != 0);
+
+ if (leaf->nvals < 3)
+ {
+ h.e.type = LT_EMBED;
+ /*
+ * Header elements should be all filled because we doesn't store
+ * their amount;
+ * do the trick to fill possibly empty place
+ * n = 1 => n/2 = 0, n-1 = 0
+ * n = 2 => n/2 = 1, n-1 = 1
+ * n = 3 => n/2 = 1, n-1 = 2
+ */
+ h.e.v0 = leaf->embed[0];
+ h.e.v1 = leaf->embed[leaf->nvals/2];
+ h.e.v2 = leaf->embed[leaf->nvals-1];
+ }
+ else
+ {
+ /* root raw and root inverse */
+ uint8 rraw = 0,
+ rinv = 0;
+ /* level 1 index raw and index inverse */
+ uint8 raw[LEAF_BYTES/8] = {0},
+ inv[LEAF_BYTES/8] = {0};
+ /* zero count for raw map and inverse map */
+ uint8 cnt_00 = 0,
+ cnt_ff = 0;
+ uint8 mlen, llen;
+ uint8 splen, invlen, threshold;
+ uint8 b00, bff;
+ uint8 *buf;
+ int j;
+
+ h.b.minbyte = leaf->minbyte;
+ h.b.maxbyte = leaf->maxbyte;
+ h.b.offset = bufpos;
+
+ mlen = leaf->maxbyte+1 - leaf->minbyte;
+ for (j = 0; j < mlen; j++)
+ {
+ b00 = leaf->bytes[j] == 0;
+ bff = leaf->bytes[j] == 0xff;
+ cnt_00 += b00;
+ cnt_ff += bff;
+ raw[j/8] |= (1-b00) << (j&7);
+ inv[j/8] |= (1-bff) << (j&7);
+ Assert(j/64 == 0);
+ rraw |= (1-b00) << ((j/8)&7);
+ rinv |= (1-bff) << ((j/8)&7);
+ }
+
+ llen = (mlen-1)/8+1;
+ for (j = 0; j < llen; j++)
+ {
+ cnt_00 += raw[j] == 0;
+ cnt_ff += inv[j] == 0;
+ }
+
+ buf = buffer + bufpos;
+
+ splen = mlen + llen + 1 - cnt_00;
+ invlen = mlen + llen + 1 - cnt_ff;
+ threshold = mlen <= 4 ? 0 : /* don't compress */
+ mlen <= 8 ? mlen - 2 :
+ mlen * 3 / 4;
+
+ /* sparse map compresses well */
+ if (splen <= threshold && splen <= invlen)
+ {
+ h.b.type = LT_SPARSE;
+ *buf++ = rraw;
+ buf += intset2_compact(buf, raw, llen, false);
+ buf += intset2_compact(buf, leaf->bytes, mlen, false);
+ }
+ /* inverse sparse map compresses well */
+ else if (invlen <= threshold)
+ {
+ h.b.type = LT_INVERSE;
+ *buf++ = rinv;
+ buf += intset2_compact(buf, inv, llen, false);
+ buf += intset2_compact(buf, leaf->bytes, mlen, true);
+ }
+ /* fallback to raw type */
+ else
+ {
+ h.b.type = LT_RAW;
+ memmove(buf, leaf->bytes, mlen);
+ buf += mlen;
+ }
+
+ bufpos = buf - buffer;
+ }
+ headers[nheaders] = h;
+ nheaders++;
+ }
+
+ hlen = nheaders * sizeof(h);
+ totallen = hlen + bufpos;
+
+ trieval.bitmap = bld->bitmap;
+ trieval.val = intset2_alloc(totallen, &intset->alloc);
+ memmove(trieval.val, headers, hlen);
+ memmove((char*)trieval.val + hlen, buffer, bufpos);
+
+ triekey = bld->chunk >> CHUNK_SHIFT;
+ intset2_trie_insert(&intset->trie, triekey, trieval);
+
+ memset(&intset->current, 0, sizeof(intset->current));
+}
+
+#define EXPECT_TRUE(expr) \
+ do { \
+ Assert(expr); \
+ if (!(expr)) \
+ elog(ERROR, \
+ "%s was unexpectedly false in file \"%s\" line %u", \
+ #expr, __FILE__, __LINE__); \
+ } while (0)
+
+#define EXPECT_FALSE(expr) \
+ do { \
+ Assert(!(expr)); \
+ if (expr) \
+ elog(ERROR, \
+ "%s was unexpectedly true in file \"%s\" line %u", \
+ #expr, __FILE__, __LINE__); \
+ } while (0)
+
+#define EXPECT_EQ_U32(result_expr, expected_expr) \
+ do { \
+ uint32 result = (result_expr); \
+ uint32 expected = (expected_expr); \
+ Assert(result == expected); \
+ if (result != expected) \
+ elog(ERROR, \
+ "%s yielded %u, expected %s in file \"%s\" line %u", \
+ #result_expr, result, #expected_expr, __FILE__, __LINE__); \
+ } while (0)
+
+static void
+intset2_test_1_off(uint64 off)
+{
+ IntegerSet2 *intset;
+ uint64 i, d, v;
+
+ intset = intset2_create();
+
+#define K 799
+
+ for (i = 0, d = 1; d < (ONE << (CHUNK_SHIFT + SHIFT + 1)); i+=(d=1+i/K))
+ {
+ v = i + off;
+ EXPECT_FALSE(intset2_is_member(intset, v));
+ EXPECT_FALSE(intset2_is_member(intset, v+1));
+ if (i != 0)
+ {
+ EXPECT_TRUE(intset2_is_member(intset, v-d));
+ }
+ if (d > 1)
+ {
+ EXPECT_FALSE(intset2_is_member(intset, v-1));
+ EXPECT_FALSE(intset2_is_member(intset, v-(d-1)));
+ }
+ intset2_add_member(intset, v);
+ EXPECT_TRUE(intset2_is_member(intset, v));
+ if (i != 0)
+ {
+ EXPECT_TRUE(intset2_is_member(intset, v-d));
+ }
+ if (d > 1)
+ {
+ EXPECT_FALSE(intset2_is_member(intset, v-1));
+ EXPECT_FALSE(intset2_is_member(intset, v-(d-1)));
+ }
+ EXPECT_FALSE(intset2_is_member(intset, v+1));
+ }
+
+ for (i = 0, d = 0; d < (1 << (CHUNK_SHIFT + SHIFT + 1)); i+=(d=1+i/K))
+ {
+ v = i + off;
+
+ EXPECT_TRUE(intset2_is_member(intset, v));
+ if (d != 0)
+ {
+ EXPECT_TRUE(intset2_is_member(intset, v-d));
+ }
+ if (d > 1)
+ {
+ EXPECT_FALSE(intset2_is_member(intset, v+1));
+ EXPECT_FALSE(intset2_is_member(intset, v-1));
+ EXPECT_FALSE(intset2_is_member(intset, v-(d-1)));
+ }
+ }
+
+ intset2_free(intset);
+}
+
+void
+intset2_test_1(void)
+{
+ intset2_test_1_off(0);
+ intset2_test_1_off(1001);
+ intset2_test_1_off(10000001);
+ intset2_test_1_off(100000000001);
+}
+
+/* Tools */
+
+static inline uint32
+intset2_compact(uint8 *dest, uint8 *src, uint8 len, bool inverse)
+{
+ uint32 i, j;
+ uint8 b;
+
+ for (i = j = 0; i < len; i++)
+ {
+ b = inverse ? ~src[i] : src[i];
+ dest[j] = b;
+ j += b != 0;
+ }
+
+ return j;
+}
+
+static const uint8 popcnt[256] = {
+ 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
+ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+ 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+};
+
+static inline uint8
+pg_popcount8(uint8 b)
+{
+ return popcnt[b];
+}
+
+static inline uint8
+pg_popcount8_lowbits(uint8 b, uint8 nbits)
+{
+ Assert(nbits < 8);
+ return popcnt[b&((1<<nbits)-1)];
+}
+
+static inline uint8
+pg_popcount_small(uint8 *b, uint8 len)
+{
+ uint8 r = 0;
+ switch (len&7)
+ {
+ case 7: r += popcnt[b[6]]; /* fallthrough */
+ case 6: r += popcnt[b[5]]; /* fallthrough */
+ case 5: r += popcnt[b[4]]; /* fallthrough */
+ case 4: r += popcnt[b[3]]; /* fallthrough */
+ case 3: r += popcnt[b[2]]; /* fallthrough */
+ case 2: r += popcnt[b[1]]; /* fallthrough */
+ case 1: r += popcnt[b[0]]; /* fallthrough */
+ }
+ return r;
+}
+
diff --git a/bdbench/integerset2.h b/bdbench/integerset2.h
new file mode 100644
index 0000000..b987605
--- /dev/null
+++ b/bdbench/integerset2.h
@@ -0,0 +1,15 @@
+#ifndef INTEGERSET2_H
+#define INTEGERSET2_H
+
+typedef struct IntegerSet2 IntegerSet2;
+
+extern IntegerSet2 *intset2_create(void);
+extern void intset2_free(IntegerSet2 *intset);
+extern void intset2_add_member(IntegerSet2 *intset, uint64 x);
+extern bool intset2_is_member(IntegerSet2 *intset, uint64 x);
+
+extern uint64 intset2_num_entries(IntegerSet2 *intset);
+extern uint64 intset2_memory_usage(IntegerSet2 *intset);
+
+extern void intset2_test_1(void);
+#endif /* INTEGERSET2_H */
--
2.32.0
Yura Sokolov писал 2021-07-29 18:29:
I've attached IntegerSet2 patch for pgtools repo and benchmark results.
Branch https://github.com/funny-falcon/pgtools/tree/integerset2
Strange web-mail client... I never can be sure what it will attach...
Reattach benchmark results
Show quoted text
regards,
Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com
Attachments:
On Thu, Jul 29, 2021 at 5:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Indeed. Given that the radix tree itself has other use cases, I have
no concern about using radix tree for vacuum's dead tuples storage. It
will be better to have one that can be generally used and has some
optimizations that are helpful also for vacuum's use case, rather than
having one that is very optimized only for vacuum's use case.
What I'm about to say might be a really stupid idea, especially since
I haven't looked at any of the code already posted, but what I'm
wondering about is whether we need a full radix tree or maybe just a
radix-like lookup aid. For example, suppose that for a relation <= 8MB
in size, we create an array of 1024 elements indexed by block number.
Each element of the array stores an offset into the dead TID array.
When you need to probe for a TID, you look up blkno and blkno + 1 in
the array and then bsearch only between those two offsets. For bigger
relations, a two or three level structure could be built, or it could
always be 3 levels. This could even be done on demand, so you
initialize all of the elements to some special value that means "not
computed yet" and then fill them the first time they're needed,
perhaps with another special value that means "no TIDs in that block".
I don't know if this is better, but I do kind of like the fact that
the basic representation is just an array. It makes it really easy to
predict how much memory will be needed for a given number of dead
TIDs, and it's very DSM-friendly as well.
--
Robert Haas
EDB: http://www.enterprisedb.com
Robert Haas писал 2021-07-29 20:15:
On Thu, Jul 29, 2021 at 5:11 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:Indeed. Given that the radix tree itself has other use cases, I have
no concern about using radix tree for vacuum's dead tuples storage. It
will be better to have one that can be generally used and has some
optimizations that are helpful also for vacuum's use case, rather than
having one that is very optimized only for vacuum's use case.What I'm about to say might be a really stupid idea, especially since
I haven't looked at any of the code already posted, but what I'm
wondering about is whether we need a full radix tree or maybe just a
radix-like lookup aid. For example, suppose that for a relation <= 8MB
in size, we create an array of 1024 elements indexed by block number.
Each element of the array stores an offset into the dead TID array.
When you need to probe for a TID, you look up blkno and blkno + 1 in
the array and then bsearch only between those two offsets. For bigger
relations, a two or three level structure could be built, or it could
always be 3 levels. This could even be done on demand, so you
initialize all of the elements to some special value that means "not
computed yet" and then fill them the first time they're needed,
perhaps with another special value that means "no TIDs in that block".
8MB relation is not a problem, imo. There is no need to do anything to
handle 8MB relation.
Problem is 2TB relation. It has 256M pages and, lets suppose, 3G dead
tuples.
Then offset array will be 2GB and tuple offset array will be 6GB (2 byte
offset per tuple). 8GB in total.
We can make offset array only for higher 3 bytes of block number.
We then will have 1M offset array weighted 8MB, and there will be array
of 3byte tuple pointers (1 remaining byte from block number, and 2 bytes
from Tuple) weighted 9GB.
But using per-batch compression schemes, there could be amortized
4 byte per page and 1 byte per tuple: 1GB + 3GB = 4GB memory.
Yes, it is not as guaranteed as in array approach. But 95% of time it is
such low and even lower. And better: more tuples are dead - better
compression works. Page with all tuples dead could be encoded as little
as 5 bytes. Therefore, overall memory consumption is more stable and
predictive.
Lower memory consumption of tuple storage means there is less chance
indexes should be scanned twice or more times. It gives more
predictability in user experience.
I don't know if this is better, but I do kind of like the fact that
the basic representation is just an array. It makes it really easy to
predict how much memory will be needed for a given number of dead
TIDs, and it's very DSM-friendly as well.
Whole thing could be encoded in one single array of bytes. Just give
"pointer-to-array"+"array-size" to constructor, and use "bump allocator"
inside. Complex logical structure doesn't imply "DSM-unfriendliness".
Hmm.... I mean if it is suitably designed.
In fact, my code uses bump allocator internally to avoid "per-allocation
overhead" of "aset", "slab" or "generational". And IntegerSet2 version
even uses it for all allocations since it has no reallocatable parts.
Well, if datastructure has reallocatable parts, it could be less
friendly
to DSM.
regards,
---
Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com
Hi,
On 2021-07-29 13:15:53 -0400, Robert Haas wrote:
I don't know if this is better, but I do kind of like the fact that
the basic representation is just an array. It makes it really easy to
predict how much memory will be needed for a given number of dead
TIDs, and it's very DSM-friendly as well.
I think those advantages are far outstripped by the big disadvantage of
needing to either size the array accurately from the start, or to
reallocate the whole array. Our current pre-allocation behaviour is
very wasteful for most vacuums but doesn't handle large work_mem at all,
causing unnecessary index scans.
Greetings,
Andres Freund
On Thu, Jul 29, 2021 at 3:14 PM Andres Freund <andres@anarazel.de> wrote:
I think those advantages are far outstripped by the big disadvantage of
needing to either size the array accurately from the start, or to
reallocate the whole array. Our current pre-allocation behaviour is
very wasteful for most vacuums but doesn't handle large work_mem at all,
causing unnecessary index scans.
I agree that the current pre-allocation behavior is bad, but I don't
really see that as an issue with my idea. Fixing that would require
allocating the array in chunks, but that doesn't really affect the
core of the idea much, at least as I see it.
But I accept that Yura has a very good point about the memory usage of
what I was proposing.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On 2021-07-30 15:13:49 -0400, Robert Haas wrote:
On Thu, Jul 29, 2021 at 3:14 PM Andres Freund <andres@anarazel.de> wrote:
I think those advantages are far outstripped by the big disadvantage of
needing to either size the array accurately from the start, or to
reallocate the whole array. Our current pre-allocation behaviour is
very wasteful for most vacuums but doesn't handle large work_mem at all,
causing unnecessary index scans.I agree that the current pre-allocation behavior is bad, but I don't
really see that as an issue with my idea. Fixing that would require
allocating the array in chunks, but that doesn't really affect the
core of the idea much, at least as I see it.
Well, then it'd not really be the "simple array approach" anymore :)
But I accept that Yura has a very good point about the memory usage of
what I was proposing.
The lower memory usage also often will result in a better cache
utilization - which is a crucial factor for index vacuuming when the
index order isn't correlated with the heap order. Cache misses really
are a crucial performance factor there.
Greetings,
Andres Freund
On Fri, Jul 30, 2021 at 3:34 PM Andres Freund <andres@anarazel.de> wrote:
The lower memory usage also often will result in a better cache
utilization - which is a crucial factor for index vacuuming when the
index order isn't correlated with the heap order. Cache misses really
are a crucial performance factor there.
Fair enough.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
Today I noticed the inefficiencies of our dead tuple storage once
again, and started theorizing about a better storage method; which is
when I remembered that this thread exists, and that this thread
already has amazing results.
Are there any plans to get the results of this thread from PoC to committable?
Kind regards,
Matthias van de Meent
Hi,
On 2022-02-11 13:47:01 +0100, Matthias van de Meent wrote:
Today I noticed the inefficiencies of our dead tuple storage once
again, and started theorizing about a better storage method; which is
when I remembered that this thread exists, and that this thread
already has amazing results.Are there any plans to get the results of this thread from PoC to committable?
I'm not currently planning to work on it personally. It'd would be awesome if
somebody did...
Greetings,
Andres Freund
On Sun, Feb 13, 2022 at 11:02 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-02-11 13:47:01 +0100, Matthias van de Meent wrote:
Today I noticed the inefficiencies of our dead tuple storage once
again, and started theorizing about a better storage method; which is
when I remembered that this thread exists, and that this thread
already has amazing results.Are there any plans to get the results of this thread from PoC to committable?
I'm not currently planning to work on it personally. It'd would be awesome if
somebody did...
Actually, I'm working on simplifying and improving radix tree
implementation for PG16 dev cycle. From the discussion so far I think
it's better to have a data structure that can be used for
general-purpose and is also good for storing TID, not very specific to
store TID. So I think radix tree would be a potent candidate. I have
done the insertion and search implementation.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On 2022-02-13 12:36:13 +0900, Masahiko Sawada wrote:
Actually, I'm working on simplifying and improving radix tree
implementation for PG16 dev cycle. From the discussion so far I think
it's better to have a data structure that can be used for
general-purpose and is also good for storing TID, not very specific to
store TID. So I think radix tree would be a potent candidate. I have
done the insertion and search implementation.
Awesome!
Hi,
On Sun, Feb 13, 2022 at 12:39 PM Andres Freund <andres@anarazel.de> wrote:
On 2022-02-13 12:36:13 +0900, Masahiko Sawada wrote:
Actually, I'm working on simplifying and improving radix tree
implementation for PG16 dev cycle. From the discussion so far I think
it's better to have a data structure that can be used for
general-purpose and is also good for storing TID, not very specific to
store TID. So I think radix tree would be a potent candidate. I have
done the insertion and search implementation.Awesome!
To move this project forward, I've implemented radix tree
implementation from scratch while studying Andres's implementation. It
supports insertion, search, and iteration but not deletion yet. In my
implementation, I use Datum as the value so internal and lead nodes
have the same data structure, simplifying the implementation. The
iteration on the radix tree returns keys with the value in ascending
order of the key. The patch has regression tests for radix tree but is
still in PoC state: left many debugging codes, not supported SSE2 SIMD
instructions, added -mavx2 flag is hard-coded.
I've measured the size and loading and lookup performance for each
candidate data structure with two test cases: dense and sparse, by
using the test tool[1]https://github.com/MasahikoSawada/pgtools/tree/master/bdbench. Here are the results:
* Case1 - Dense (simulating the case where there are 1000 consecutive
pages each of which has 100 dead tuples, at 100 page intervals.)
select prepare(
1000000, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within a page
1000, -- # of consecutive pages having dead tuples
1100 -- page interval
);
name size attach lookup
array 520 MB 248.60 ms 89891.92 ms
hash 3188 MB 28029.59 ms 50850.32 ms
intset 85 MB 644.96 ms 39801.17 ms
tbm 96 MB 474.06 ms 6641.38 ms
radix 37 MB 173.03 ms 9145.97 ms
radix_tree 36 MB 184.51 ms 9729.94 ms
* Case2 - Sparse (simulating a case where there are pages that have 2
dead tuples every 1000 pages.)
select prepare(
10000000, -- max block
2, -- # of dead tuples per page
50, -- dead tuples interval within a page
1, -- # of consecutive pages having dead tuples
1000 -- page interval
);
name size attach lookup
array 125 kB 0.53 ms 82183.61 ms
hash 1032 kB 1.31 ms 28128.33 ms
intset 222 kB 0.51 ms 87775.68 ms
tbm 768 MB 1.24 ms 98674.60 ms
radix 1080 kB 1.66 ms 20698.07 ms
radix_tree 949 kB 1.50 ms 21465.23 ms
Each test virtually generates TIDs and loads them to the data
structure, and then searches for virtual index TIDs.
'array' is a sorted array which is the current method, 'hash' is HTAB,
'intset' is IntegerSet, and 'tbm' is TIDBitmap. The last two results
are radix tree implementations: 'radix' is Andres's radix tree
implementation and 'radix_tree' is my radix tree implementation. In
both radix tree tests, I converted TIDs into an int64 and store the
lower 6 bits in the value part of the radix tree.
Overall, radix tree implementations have good numbers. Once we got an
agreement on moving in this direction, I'll start a new thread for
that and move the implementation further; there are many things to do
and discuss: deletion, API design, SIMD support, more tests etc.
Regards,
[1]: https://github.com/MasahikoSawada/pgtools/tree/master/bdbench
[2]: /messages/by-id/CAFiTN-visUO9VTz2+h224z5QeUjKhKNdSfjaCucPhYJdbzxx0g@mail.gmail.com
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Attachments:
radixtree.patchapplication/octet-stream; name=radixtree.patchDownload
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..fd002d594a 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,9 @@ OBJS = \
integerset.o \
knapsack.o \
pairingheap.o \
+ radixtree.o \
rbtree.o \
+radixtree.o: CFLAGS+=-mavx2
+
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..a5ad897ee9
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,1377 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module is based on the paper "The Adaptive Radix Tree: ARTful Indexing
+ * for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas Neumann,
+ * 2013.
+ *
+ * There are some difference from the proposed implementation. For instance,
+ * this radix tree module utilize AVX2 instruction, enabling us to use 256-bit
+ * width SIMD vector, whereas 128-bit witdh SIMD vector is used in the paper.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "utils/memutils.h"
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+
+#if defined(__AVX2__)
+#include <immintrin.h> // x86 AVX2 intrinsics
+#endif
+
+/* How many bits are encoded in one tree level */
+#define RADIX_TREE_NODE_FANOUT 8
+
+#define RADIX_TREE_NODE_MAX_SLOTS (1 << RADIX_TREE_NODE_FANOUT)
+#define RADIX_TREE_NODE_MAX_SLOT_BITS \
+ (RADIX_TREE_NODE_MAX_SLOTS / (sizeof(uint8) * BITS_PER_BYTE))
+
+#define RADIX_TREE_CHUNK_MASK ((1 << RADIX_TREE_NODE_FANOUT) - 1)
+#define RADIX_TREE_MAX_SHIFT key_get_shift(UINT64_MAX)
+#define RADIX_TREE_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RADIX_TREE_NODE_FANOUT)
+
+#define GET_KEY_CHUNK(key, shift) \
+ ((uint8) (((key) >> (shift)) & RADIX_TREE_CHUNK_MASK))
+
+typedef enum radix_tree_node_kind
+{
+ RADIX_TREE_NODE_KIND_4 = 0,
+ RADIX_TREE_NODE_KIND_32,
+ RADIX_TREE_NODE_KIND_128,
+ RADIX_TREE_NODE_KIND_256
+} radix_tree_node_kind;
+#define RADIX_TREE_NODE_KIND_COUNT 4
+
+/*
+ * Base type for all nodes types.
+ *
+ * The key is a 64-bit unsigned integer and the value is a Datum. The internal
+ * tree nodes, shift > 0, store the pointer to its child nodes as a Datum value.
+ * The leaf nodes, shift == 0, stores the value that the user specified as a Datum
+ * value.
+ */
+typedef struct radix_tree_node
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Shift indicates which part of the key space is represented by this node.
+ * That is, the key is shifted by 'shift' and the lowest RADIX_TREE_NODE_FANOUT
+ * bits are then represented in chunk.
+ */
+ uint8 shift;
+ uint8 chunk;
+
+ /* Size class of the node */
+ radix_tree_node_kind kind;
+} radix_tree_node;
+#define NodeIsLeaf(n) (((radix_tree_node *) (n))->shift == 0)
+#define NodeHasFreeSlot(n) \
+ (((radix_tree_node *) (n))->count < \
+ radix_tree_node_info[((radix_tree_node *) (n))->kind].max_slots)
+
+/*
+ * To reduce memory usage compared to a simple radix tree with a fixed fanout
+ * we use adaptive node sides, with different storage methods for different
+ * numbers of elements.
+ */
+typedef struct radix_tree_node_4
+{
+ radix_tree_node n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+ Datum slots[4];
+} radix_tree_node_4;
+
+typedef struct radix_tree_node_32
+{
+ radix_tree_node n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+ Datum slots[32];
+} radix_tree_node_32;
+
+typedef struct radix_tree_node_128
+{
+ radix_tree_node n;
+
+ /*
+ * The index of slots for each fanout. 0 means unused whereas slots is
+ * 0-indexed. So we can get the slots of the chunk C by slots[C - 1].
+ */
+ uint8 slot_idxs[RADIX_TREE_NODE_MAX_SLOTS];
+
+ Datum slots[128];
+} radix_tree_node_128;
+
+typedef struct radix_tree_node_256
+{
+ radix_tree_node n;
+
+ /* A bitmap to track which slot is in use */
+ uint8 set[RADIX_TREE_NODE_MAX_SLOT_BITS];
+
+ Datum slots[RADIX_TREE_NODE_MAX_SLOTS];
+} radix_tree_node_256;
+#define RADIX_TREE_NODE_256_SET_BYTE(v) ((v) / RADIX_TREE_NODE_FANOUT)
+#define RADIX_TREE_NODE_256_SET_BIT(v) (UINT64_C(1) << ((v) % RADIX_TREE_NODE_FANOUT))
+
+/* Information of each size class */
+typedef struct radix_tree_node_info_elem
+{
+ const char *name;
+ int max_slots;
+ Size size;
+} radix_tree_node_info_elem;
+
+static radix_tree_node_info_elem radix_tree_node_info[] =
+{
+ {"radix tree node 4", 4, sizeof(radix_tree_node_4)},
+ {"radix tree node 32", 32, sizeof(radix_tree_node_32)},
+ {"radix tree node 128", 128, sizeof(radix_tree_node_128)},
+ {"radix tree node 256", 256, sizeof(radix_tree_node_256)},
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending order
+ * of the key. To support this, the we iterate nodes of each level.
+ * radix_tree_iter_node_data struct is used to track the iteration within a node.
+ * radix_tree_iter has the array of this struct, stack, in order to track the iteration
+ * of every level. During the iteration, we also construct the key to return. The key
+ * is updated whenever we update the node iteration information, e.g., when advancing
+ * the current index within the node or when moving to the next node at the same level.
+ */
+typedef struct radix_tree_iter_node_data
+{
+ radix_tree_node *node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} radix_tree_iter_node_data;
+
+struct radix_tree_iter
+{
+ radix_tree *tree;
+
+ /* Track the iteration on nodes of each level */
+ radix_tree_iter_node_data stack[RADIX_TREE_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ radix_tree_node *root;
+ uint64 max_val;
+ uint64 num_keys;
+ MemoryContextData *slabs[RADIX_TREE_NODE_KIND_COUNT];
+
+ /* stats */
+ uint64 mem_used;
+ int32 cnt[RADIX_TREE_NODE_KIND_COUNT];
+};
+
+static radix_tree_node *radix_tree_node_grow(radix_tree *tree, radix_tree_node *parent, radix_tree_node *node);
+static radix_tree_node *radix_tree_find_child(radix_tree_node *node, uint64 key);
+static Datum *radix_tree_find_slot_ptr(radix_tree_node *node, uint8 chunk);
+static void radix_tree_replace_slot(radix_tree_node *parent, radix_tree_node *node,
+ uint8 chunk);
+static void radix_tree_extend(radix_tree *tree, uint64 key);
+static void radix_tree_new_root(radix_tree *tree, uint64 key, Datum val);
+static radix_tree_node *radix_tree_insert_child(radix_tree *tree, radix_tree_node *parent, radix_tree_node *node,
+ uint64 key);
+static void radix_tree_insert_val(radix_tree *tree, radix_tree_node *parent, radix_tree_node *node,
+ uint64 key, Datum val, bool *replaced_p);
+
+static inline void radix_tree_iter_update_key(radix_tree_iter *iter, uint8 chunk, uint8 shift);
+static Datum radix_tree_node_iterate_next(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+ bool *found_p);
+static void radix_tree_store_iter_node(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+ radix_tree_node *node);
+static void radix_tree_update_iter_stack(radix_tree_iter *iter, int from);
+
+static inline int
+node_32_search_eq(radix_tree_node_32 *node, uint8 chunk)
+{
+#ifdef __AVX2__
+ __m256i _key = _mm256_set1_epi8(chunk);
+ __m256i _data = _mm256_loadu_si256((__m256i_u *) node->chunks);
+ __m256i _cmp = _mm256_cmpeq_epi8(_key, _data);
+ uint32 bitfield = _mm256_movemask_epi8(_cmp);
+
+ bitfield &= ((UINT64_C(1) << node->n.count) - 1);
+
+ return (bitfield) ? __builtin_ctz(bitfield) : -1;
+
+#else
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] > chunk)
+ return -1;
+
+ if (node->chunks[i] == chunk)
+ return i;
+ }
+
+ return -1;
+#endif /* __AVX2__ */
+}
+
+/*
+ * This is a bit more complicated than search_chunk_array_16_eq(), because
+ * until recently no unsigned uint8 comparison instruction existed on x86. So
+ * we need to play some trickery using _mm_min_epu8() to effectively get
+ * <=. There never will be any equal elements in the current uses, but that's
+ * what we get here...
+ */
+static inline int
+node_32_search_le(radix_tree_node_32 *node, uint8 chunk)
+{
+#ifdef __AVX2__
+ __m256i _key = _mm256_set1_epi8(chunk);
+ __m256i _data = _mm256_loadu_si256((__m256i_u*) node->chunks);
+ __m256i _min = _mm256_min_epu8(_key, _data);
+ __m256i cmp = _mm256_cmpeq_epi8(_key, _min);
+ uint32_t bitfield=_mm256_movemask_epi8(cmp);
+
+ bitfield &= ((UINT64_C(1) << node->n.count) - 1);
+
+ return (bitfield) ? __builtin_ctz(bitfield) : node->n.count;
+#else
+ int index;
+
+ for (index = 0; index < node->n.count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+
+ return index;
+#endif /* __AVX2__ */
+}
+
+static inline int
+node_128_get_slot_pos(radix_tree_node_128 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] - 1;
+}
+
+static inline bool
+node_128_is_slot_used(radix_tree_node_128 *node, uint8 chunk)
+{
+ return (node_128_get_slot_pos(node, chunk) >= 0);
+}
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_256_is_slot_used(radix_tree_node_256 *node, uint8 chunk)
+{
+ return (node->set[RADIX_TREE_NODE_256_SET_BYTE(chunk)] &
+ RADIX_TREE_NODE_256_SET_BIT(chunk)) != 0;
+
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_256_set(radix_tree_node_256 *node, uint8 chunk, Datum slot)
+{
+ node->slots[chunk] = slot;
+ node->set[RADIX_TREE_NODE_256_SET_BYTE(chunk)] |= RADIX_TREE_NODE_256_SET_BIT(chunk);
+}
+
+/* Return the shift that is satisfied to store the given key */
+inline static int
+key_get_shift(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RADIX_TREE_NODE_FANOUT) * RADIX_TREE_NODE_FANOUT;
+}
+
+/* Return the max value stored in a node with the given shift */
+static uint64
+shift_get_max_val(int shift)
+{
+ if (shift == RADIX_TREE_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64_C(1) << (shift + RADIX_TREE_NODE_FANOUT)) - 1;
+}
+
+/* Allocate a new node with the given node kind */
+static radix_tree_node *
+radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind)
+{
+ radix_tree_node *newnode;
+
+ newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind],
+ radix_tree_node_info[kind].size);
+ newnode->kind = kind;
+
+ /* update stats */
+ tree->mem_used += GetMemoryChunkSpace(newnode);
+ tree->cnt[kind]++;
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+radix_tree_free_node(radix_tree *tree, radix_tree_node *node)
+{
+ /* update stats */
+ tree->mem_used -= GetMemoryChunkSpace(node);
+ tree->cnt[node->kind]--;
+
+ pfree(node);
+}
+
+/* Copy the common fields without the node kind */
+static void
+radix_tree_copy_node_common(radix_tree_node *src, radix_tree_node *dst)
+{
+ dst->shift = src->shift;
+ dst->chunk = src->chunk;
+ dst->count = src->count;
+}
+
+/* The tree doesn't have not sufficient height, so grow it */
+static void
+radix_tree_extend(radix_tree *tree, uint64 key)
+{
+ int max_shift;
+ int shift = tree->root->shift + RADIX_TREE_NODE_FANOUT;
+
+ max_shift = key_get_shift(key);
+
+ /* Grow tree from 'shift' to 'max_shift' */
+ while (shift <= max_shift)
+ {
+ radix_tree_node_4 *node =
+ (radix_tree_node_4 *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_4);
+
+ node->n.count = 1;
+ node->n.shift = shift;
+ node->chunks[0] = 0;
+ node->slots[0] = PointerGetDatum(tree->root);
+
+ tree->root->chunk = 0;
+ tree->root = (radix_tree_node *) node;
+
+ shift += RADIX_TREE_NODE_FANOUT;
+ }
+
+ tree->max_val = shift_get_max_val(max_shift);
+}
+
+/*
+ * Return the pointer to the child node corresponding with the key. Otherwise (if
+ * not found) return NULL.
+ */
+static radix_tree_node *
+radix_tree_find_child(radix_tree_node *node, uint64 key)
+{
+ Datum *slot_ptr;
+ int chunk = GET_KEY_CHUNK(key, node->shift);
+
+ slot_ptr = radix_tree_find_slot_ptr(node, chunk);
+
+ return (slot_ptr == NULL) ? NULL : (radix_tree_node *) DatumGetPointer(*slot_ptr);
+}
+
+/*
+ * Return the address of the slot corresponding to chunk in the node, if found.
+ * Otherwise return NULL.
+ */
+static Datum *
+radix_tree_find_slot_ptr(radix_tree_node *node, uint8 chunk)
+{
+
+ switch (node->kind)
+ {
+ case RADIX_TREE_NODE_KIND_4:
+ {
+ radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+
+ /* Do linear search */
+ for (int i = 0; i < n4->n.count; i++)
+ {
+ if (n4->chunks[i] > chunk)
+ break;
+
+ if (n4->chunks[i] == chunk)
+ return &(n4->slots[i]);
+ }
+
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_32:
+ {
+ radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+ int ret;
+
+ /* Search by SIMD instructions */
+ ret = node_32_search_eq(n32, chunk);
+
+ if (ret < 0)
+ break;
+
+ return &(n32->slots[ret]);
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_128:
+ {
+ radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+
+ if (!node_128_is_slot_used(n128, chunk))
+ break;
+
+ return &(n128->slots[node_128_get_slot_pos(n128, chunk)]);
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_256:
+ {
+ radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+
+ if (!node_256_is_slot_used(n256, chunk))
+ break;
+
+ return &(n256->slots[chunk]);
+ break;
+ }
+ }
+
+ return NULL;
+}
+
+/* Link from the parent to the node */
+static void
+radix_tree_replace_slot(radix_tree_node *parent, radix_tree_node *node, uint8 chunk)
+{
+ Datum *slot_ptr;
+
+ slot_ptr = radix_tree_find_slot_ptr(parent, chunk);
+ *slot_ptr = PointerGetDatum(node);
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+radix_tree_new_root(radix_tree *tree, uint64 key, Datum val)
+{
+ radix_tree_node_4 * n4 =
+ (radix_tree_node_4 * ) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_4);
+ int shift = key_get_shift(key);
+
+ n4->n.shift = shift;
+ tree->max_val = shift_get_max_val(shift);
+ tree->root = (radix_tree_node *) n4;
+}
+
+/* Insert 'node' as a child node of 'parent' */
+static radix_tree_node *
+radix_tree_insert_child(radix_tree *tree, radix_tree_node *parent, radix_tree_node *node,
+ uint64 key)
+{
+ radix_tree_node *newchild =
+ (radix_tree_node *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_4);
+
+ Assert(!NodeIsLeaf(node));
+
+ newchild->shift = node->shift - RADIX_TREE_NODE_FANOUT;
+ newchild->chunk = GET_KEY_CHUNK(key, node->shift);
+
+ radix_tree_insert_val(tree, parent, node, key, PointerGetDatum(newchild), NULL);
+
+ return (radix_tree_node *) newchild;
+}
+
+/*
+ * Insert the value to the node. The node grows if it's full.
+ *
+ * *replaced_p is set to true if the key already exists and its value is updated
+ * by this function.
+ */
+static void
+radix_tree_insert_val(radix_tree *tree, radix_tree_node *parent, radix_tree_node *node,
+ uint64 key, Datum val, bool *replaced_p)
+{
+ int chunk = GET_KEY_CHUNK(key, node->shift);
+ bool replaced = false;
+
+ switch (node->kind)
+ {
+ case RADIX_TREE_NODE_KIND_4:
+ {
+ radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+ int idx;
+
+ for (idx = 0; idx < n4->n.count; idx++)
+ {
+ if (n4->chunks[idx] >= chunk)
+ break;
+ }
+
+ if (NodeHasFreeSlot(n4))
+ {
+ if (n4->n.count == 0)
+ {
+ /* the first key for this node, add it */
+ }
+ else if (n4->chunks[idx] == chunk)
+ {
+ /* found the key, replace it */
+ replaced = true;
+ }
+ else if (idx != n4->n.count)
+ {
+ /*
+ * the key needs to be inserted in the middle of the array,
+ * make space for the new key.
+ */
+ memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]),
+ sizeof(uint8) * (n4->n.count - idx));
+ memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]),
+ sizeof(radix_tree_node *) * (n4->n.count - idx));
+ }
+
+ n4->chunks[idx] = chunk;
+ n4->slots[idx] = val;
+
+ /* Done */
+ break;
+ }
+
+ /* The node needs to grow */
+ node = radix_tree_node_grow(tree, parent, node);
+ Assert(node->kind == RADIX_TREE_NODE_KIND_32);
+ }
+ /* FALLTHROUGH */
+ case RADIX_TREE_NODE_KIND_32:
+ {
+ radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+ int idx;
+
+ idx = node_32_search_le(n32, chunk);
+
+ if (NodeHasFreeSlot(n32))
+ {
+ if (n32->n.count == 0)
+ {
+ /* first key for this node, add it */
+ }
+ else if (n32->chunks[idx] == chunk)
+ {
+ /* found the key, replace it */
+ replaced = true;
+ }
+ else if (idx != n32->n.count)
+ {
+ /*
+ * the key needs to be inserted in the middle of the array,
+ * make space for the new key.
+ */
+ memmove(&(n32->chunks[idx + 1]), &(n32->chunks[idx]),
+ sizeof(uint8) * (n32->n.count - idx));
+ memmove(&(n32->slots[idx + 1]), &(n32->slots[idx]),
+ sizeof(radix_tree_node *) * (n32->n.count - idx));
+ }
+
+ n32->chunks[idx] = chunk;
+ n32->slots[idx] = val;
+ break;
+ }
+
+ /* The node needs to grow */
+ node = radix_tree_node_grow(tree, parent, node);
+ Assert(node->kind == RADIX_TREE_NODE_KIND_128);
+ }
+ /* FALLTHROUGH */
+ case RADIX_TREE_NODE_KIND_128:
+ {
+ radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+
+ if (node_128_is_slot_used(n128, chunk))
+ {
+ n128->slots[node_128_get_slot_pos(n128, chunk)] = val;
+ replaced = true;
+ break;
+ }
+
+ if (NodeHasFreeSlot(n128))
+ {
+ uint8 pos = n128->n.count + 1;
+
+ n128->slot_idxs[chunk] = pos;
+ n128->slots[pos - 1] = val;
+ break;
+ }
+
+ node = radix_tree_node_grow(tree, parent, node);
+ Assert(node->kind == RADIX_TREE_NODE_KIND_256);
+ }
+ /* FALLTHROUGH */
+ case RADIX_TREE_NODE_KIND_256:
+ {
+ radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+
+ if (node_256_is_slot_used(n256, chunk))
+ replaced = true;
+
+ node_256_set(n256, chunk, val);
+ break;
+ }
+ }
+
+ if (!replaced)
+ node->count++;
+
+ if (replaced_p)
+ *replaced_p = replaced;
+}
+
+/* Change the node type to a larger one */
+static radix_tree_node *
+radix_tree_node_grow(radix_tree *tree, radix_tree_node *parent, radix_tree_node *node)
+{
+ radix_tree_node *newnode = NULL;
+
+ Assert(node->count ==
+ radix_tree_node_info[node->kind].max_slots);
+
+ switch (node->kind)
+ {
+ case RADIX_TREE_NODE_KIND_4:
+ {
+ radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+ radix_tree_node_32 *new32 =
+ (radix_tree_node_32 *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_32);
+
+ radix_tree_copy_node_common((radix_tree_node *) n4,
+ (radix_tree_node *) new32);
+
+ memcpy(&(new32->chunks), &(n4->chunks), sizeof(uint8) * 4);
+ memcpy(&(new32->slots), &(n4->slots), sizeof(Datum) * 4);
+
+ newnode = (radix_tree_node *) new32;
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_32:
+ {
+ radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+ radix_tree_node_128 *new128 =
+ (radix_tree_node_128 *) radix_tree_alloc_node(tree,RADIX_TREE_NODE_KIND_128);
+
+ radix_tree_copy_node_common((radix_tree_node *) n32,
+ (radix_tree_node *) new128);
+
+ for (int i = 0; i < n32->n.count; i++)
+ {
+ new128->slot_idxs[n32->chunks[i]] = i + 1;
+ new128->slots[i] = n32->slots[i];
+ }
+
+ newnode = (radix_tree_node *) new128;
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_128:
+ {
+ radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+ radix_tree_node_256 *new256 =
+ (radix_tree_node_256 *) radix_tree_alloc_node(tree,RADIX_TREE_NODE_KIND_256);
+ int cnt = 0;
+
+ radix_tree_copy_node_common((radix_tree_node *) n128,
+ (radix_tree_node *) new256);
+
+ for (int i = 0; i < RADIX_TREE_NODE_MAX_SLOTS && cnt < n128->n.count; i++)
+ {
+ if (!node_128_is_slot_used(n128, i))
+ continue;
+
+ node_256_set(new256, i, n128->slots[node_128_get_slot_pos(n128, i)]);
+ cnt++;
+ }
+
+ newnode = (radix_tree_node *) new256;
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_256:
+ elog(ERROR, "radix tree node_256 cannot grow");
+ break;
+ }
+
+ /* Replace the old node with the new one */
+ if (parent == node)
+ tree->root = newnode;
+ else
+ radix_tree_replace_slot(parent, newnode, node->chunk);
+
+ /* Free the old node */
+ radix_tree_free_node(tree, node);
+
+ return newnode;
+}
+
+/* Create the radix tree in the given memory context */
+radix_tree *
+radix_tree_create(MemoryContext ctx)
+{
+ radix_tree *tree;
+ MemoryContext old_ctx;
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = palloc(sizeof(radix_tree));
+ tree->max_val = 0;
+ tree->root = NULL;
+ tree->context = ctx;
+ tree->num_keys = 0;
+ tree->mem_used = 0;
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RADIX_TREE_NODE_KIND_COUNT; i++)
+ {
+ tree->slabs[i] = SlabContextCreate(ctx,
+ radix_tree_node_info[i].name,
+ SLAB_DEFAULT_BLOCK_SIZE,
+ radix_tree_node_info[i].size);
+ tree->cnt[i] = 0;
+ }
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+void
+radix_tree_destroy(radix_tree *tree)
+{
+ for (int i = 0; i < RADIX_TREE_NODE_KIND_COUNT; i++)
+ MemoryContextDelete(tree->slabs[i]);
+
+ pfree(tree);
+}
+
+/*
+ * Insert the key with the val.
+ *
+ * found_p is set to true if the key already present, otherwise false, if
+ * it's not NULL.
+ *
+ * XXX: consider a better API. Is it better to support like 'update' flag
+ * instead of 'found_p' so the user can asks to update the value if already
+ * exists?
+ */
+void
+radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
+{
+ int shift;
+ bool replaced;
+ radix_tree_node *node;
+ radix_tree_node *parent = tree->root;
+
+ /* Empty tree, create the root */
+ if (!tree->root)
+ radix_tree_new_root(tree, key, val);
+
+ /* Extend the tree if necessary */
+ if (key > tree->max_val)
+ radix_tree_extend(tree, key);
+
+ Assert(tree->root);
+
+ shift = tree->root->shift;
+ node = tree->root;
+ while (shift > 0)
+ {
+ radix_tree_node *child;
+
+ child = radix_tree_find_child(node, key);
+
+ if (child == NULL)
+ child = radix_tree_insert_child(tree, parent, node, key);
+
+ parent = node;
+ node = child;
+ shift -= RADIX_TREE_NODE_FANOUT;
+ }
+
+ /* arrived at a leaf, so insert the value */
+ Assert(NodeIsLeaf(node));
+ radix_tree_insert_val(tree, parent, node, key, val, &replaced);
+
+ if (!replaced)
+ tree->num_keys++;
+
+ if (found_p)
+ *found_p = replaced;
+}
+
+/*
+ * Return the Datum value of the given key.
+ *
+ * found_p is set to true if it's found, otherwise false.
+ */
+Datum
+radix_tree_search(radix_tree *tree, uint64 key, bool *found_p)
+{
+ radix_tree_node *node;
+ int shift;
+
+ if (!tree->root || key > tree->max_val)
+ goto not_found;
+
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ radix_tree_node *child;
+
+ if (NodeIsLeaf(node))
+ {
+ Datum *slot_ptr;
+ int chunk = GET_KEY_CHUNK(key, node->shift);
+
+ /* We reached at a leaf node, find the corresponding slot */
+ slot_ptr = radix_tree_find_slot_ptr(node, chunk);
+
+ if (slot_ptr == NULL)
+ goto not_found;
+
+ /* Found! */
+ *found_p = true;
+ return *slot_ptr;
+ }
+
+ child = radix_tree_find_child(node, key);
+
+ if (child == NULL)
+ goto not_found;
+
+ node = child;
+ shift -= RADIX_TREE_NODE_FANOUT;
+ }
+
+not_found:
+ *found_p = false;
+ return (Datum) 0;
+}
+
+/* Create and return the iterator for the given radix tree */
+radix_tree_iter *
+radix_tree_begin_iterate(radix_tree *tree)
+{
+ MemoryContext old_ctx;
+ radix_tree_iter *iter;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (radix_tree_iter *) palloc0(sizeof(radix_tree_iter));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree)
+ return iter;
+
+ top_level = iter->tree->root->shift / RADIX_TREE_NODE_FANOUT;
+
+ iter->stack_len = top_level;
+ iter->stack[top_level].node = iter->tree->root;
+ iter->stack[top_level].current_idx = -1;
+
+ /* Descend to the left most leaf node from the root */
+ radix_tree_update_iter_stack(iter, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+bool
+radix_tree_iterate_next(radix_tree_iter *iter, uint64 *key_p, Datum *value_p)
+{
+ bool found = false;
+ Datum slot = (Datum) 0;
+ int level;
+
+ /* Empty tree */
+ if (!iter->tree)
+ return false;
+
+ for (;;)
+ {
+ radix_tree_node *node;
+ radix_tree_iter_node_data *node_iter;
+
+ /*
+ * Iterate node at each level from the bottom of the tree until we find
+ * the next slot.
+ */
+ for (level = 0; level <= iter->stack_len; level++)
+ {
+ slot = radix_tree_node_iterate_next(iter, &(iter->stack[level]), &found);
+
+ if (found)
+ break;
+ }
+
+ /* end of iteration */
+ if (!found)
+ return false;
+
+ /* found the next slot at the leaf node, return it */
+ if (level == 0)
+ {
+ *key_p = iter->key;
+ *value_p = slot;
+ return true;
+ }
+
+ /*
+ * We have advanced more than one nodes including internal nodes. So we need
+ * to update the stack by descending to the left most leaf node from this level.
+ */
+ node = (radix_tree_node *) DatumGetPointer(slot);
+ node_iter = &(iter->stack[level - 1]);
+ radix_tree_store_iter_node(iter, node_iter, node);
+
+ radix_tree_update_iter_stack(iter, level - 1);
+ }
+}
+
+void
+radix_tree_end_iterate(radix_tree_iter *iter)
+{
+ pfree(iter);
+}
+
+/*
+ * Update the part of the key being constructed during the iteration with the
+ * given chunk
+ */
+static inline void
+radix_tree_iter_update_key(radix_tree_iter *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RADIX_TREE_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Iterate over the given radix tree node and returns the next slot of the given
+ * node and set true to *found_p, if any. Otherwise, set false to *found_p.
+ */
+static Datum
+radix_tree_node_iterate_next(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+ bool *found_p)
+{
+ radix_tree_node *node = node_iter->node;
+ Datum slot = (Datum) 0;
+
+ switch (node->kind)
+ {
+ case RADIX_TREE_NODE_KIND_4:
+ {
+ radix_tree_node_4 *n4 = (radix_tree_node_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+
+ if (node_iter->current_idx >= n4->n.count)
+ goto not_found;
+
+ slot = n4->slots[node_iter->current_idx];
+
+ /* Update the part of the key with the current chunk */
+ if (NodeIsLeaf(node))
+ radix_tree_iter_update_key(iter, n4->chunks[node_iter->current_idx], 0);
+
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_32:
+ {
+ radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+
+ node_iter->current_idx++;
+
+ if (node_iter->current_idx >= n32->n.count)
+ goto not_found;
+
+ slot = n32->slots[node_iter->current_idx];
+
+ /* Update the part of the key with the current chunk */
+ if (NodeIsLeaf(node))
+ radix_tree_iter_update_key(iter, n32->chunks[node_iter->current_idx], 0);
+
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_128:
+ {
+ radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RADIX_TREE_NODE_MAX_SLOTS; i++)
+ {
+ if (node_128_is_slot_used(n128, i))
+ break;
+ }
+
+ if (i >= RADIX_TREE_NODE_MAX_SLOTS)
+ goto not_found;
+
+ node_iter->current_idx = i;
+ slot = n128->slots[node_128_get_slot_pos(n128, i)];
+
+ /* Update the part of the key */
+ if (NodeIsLeaf(node))
+ radix_tree_iter_update_key(iter, node_iter->current_idx, 0);
+
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_256:
+ {
+ radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RADIX_TREE_NODE_MAX_SLOTS; i++)
+ {
+ if (node_256_is_slot_used(n256, i))
+ break;
+ }
+
+ if (i >= RADIX_TREE_NODE_MAX_SLOTS)
+ goto not_found;
+
+ node_iter->current_idx = i;
+ slot = n256->slots[i];
+
+ /* Update the part of the key */
+ if (NodeIsLeaf(node))
+ radix_tree_iter_update_key(iter, node_iter->current_idx, 0);
+
+ break;
+ }
+ }
+
+ *found_p = true;
+ return slot;
+
+not_found:
+ *found_p = false;
+ return (Datum) 0;
+}
+
+/*
+ * Initialize and update the node iteration struct with the given radix tree node.
+ * This function also updates the part of the key with the chunk of the given node.
+ */
+static void
+radix_tree_store_iter_node(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+ radix_tree_node *node)
+{
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ radix_tree_iter_update_key(iter, node->chunk, node->shift + RADIX_TREE_NODE_FANOUT);
+}
+
+/*
+ * Build the stack of the radix tree node while descending to the leaf from the 'from'
+ * level.
+ */
+static void
+radix_tree_update_iter_stack(radix_tree_iter *iter, int from)
+{
+ radix_tree_node *node = iter->stack[from].node;
+ int level = from;
+
+ for (;;)
+ {
+ radix_tree_iter_node_data *node_iter = &(iter->stack[level--]);
+ bool found;
+
+ /* Set the current node */
+ radix_tree_store_iter_node(iter, node_iter, node);
+
+ if (NodeIsLeaf(node))
+ break;
+
+ node = (radix_tree_node *)
+ DatumGetPointer(radix_tree_node_iterate_next(iter, node_iter, &found));
+
+ /*
+ * Since we always get the first slot in the node, we have to found
+ * the slot.
+ */
+ Assert(found);
+ }
+}
+
+uint64
+radix_tree_num_entries(radix_tree *tree)
+{
+ return tree->num_keys;
+}
+
+uint64
+radix_tree_memory_usage(radix_tree *tree)
+{
+ return tree->mem_used;
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RADIX_TREE_DEBUG
+void
+radix_tree_stats(radix_tree *tree)
+{
+ fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u(%lu), n32 = %u(%lu), n128 = %u(%lu), n256 = %u(%lu)",
+ tree->num_keys,
+ tree->root->shift / RADIX_TREE_NODE_FANOUT,
+ tree->cnt[0], tree->cnt[0] * sizeof(radix_tree_node_4),
+ tree->cnt[1], tree->cnt[1] * sizeof(radix_tree_node_32),
+ tree->cnt[2], tree->cnt[2] * sizeof(radix_tree_node_128),
+ tree->cnt[3], tree->cnt[3] * sizeof(radix_tree_node_256));
+ //radix_tree_dump(tree);
+}
+
+static void
+radix_tree_print_slot(StringInfo buf, uint8 chunk, Datum slot, int idx, bool is_leaf, int level)
+{
+ char space[128] = {0};
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ if (is_leaf)
+ appendStringInfo(buf, "%s[%d] \"0x%X\" val(0x%lX) LEAF\n",
+ space,
+ idx,
+ chunk,
+ DatumGetInt64(slot));
+ else
+ appendStringInfo(buf , "%s[%d] \"0x%X\" -> ",
+ space,
+ idx,
+ chunk);
+}
+
+static void
+radix_tree_dump_node(radix_tree_node *node, int level, StringInfo buf, bool recurse)
+{
+ bool is_leaf = NodeIsLeaf(node);
+
+ appendStringInfo(buf, "[\"%s\" type %d, cnt %u, shift %u, chunk \"0x%X\"] chunks:\n",
+ NodeIsLeaf(node) ? "LEAF" : "INNR",
+ (node->kind == RADIX_TREE_NODE_KIND_4) ? 4 :
+ (node->kind == RADIX_TREE_NODE_KIND_32) ? 32 :
+ (node->kind == RADIX_TREE_NODE_KIND_128) ? 128 : 256,
+ node->count, node->shift, node->chunk);
+
+ switch (node->kind)
+ {
+ case RADIX_TREE_NODE_KIND_4:
+ {
+ radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+
+ for (int i = 0; i < n4->n.count; i++)
+ {
+ radix_tree_print_slot(buf, n4->chunks[i], n4->slots[i], i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ radix_tree_dump_node((radix_tree_node *) n4->slots[i], level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_32:
+ {
+ radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+
+ for (int i = 0; i < n32->n.count; i++)
+ {
+ radix_tree_print_slot(buf, n32->chunks[i], n32->slots[i], i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ radix_tree_dump_node((radix_tree_node *) n32->slots[i], level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_128:
+ {
+ radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+
+ for (int i = 0; i < RADIX_TREE_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_128_is_slot_used(n128, i))
+ continue;
+
+ radix_tree_print_slot(buf, i, n128->slots[node_128_get_slot_pos(n128, i)],
+ i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ radix_tree_dump_node((radix_tree_node *) n128->slots[node_128_get_slot_pos(n128, i)],
+ level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_256:
+ {
+ radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+
+ for (int i = 0; i < RADIX_TREE_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_256_is_slot_used(n256, i))
+ continue;
+
+ radix_tree_print_slot(buf, i, n256->slots[i], i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ radix_tree_dump_node((radix_tree_node *) n256->slots[i], level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+radix_tree_dump_search(radix_tree *tree, uint64 key)
+{
+ StringInfoData buf;
+ radix_tree_node *node;
+ int shift;
+ int level = 0;
+
+ elog(WARNING, "-----------------------------------------------------------");
+ elog(WARNING, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+ if (!tree->root)
+ {
+ elog(WARNING, "tree is empty");
+ return;
+ }
+
+ if (key > tree->max_val)
+ {
+ elog(WARNING, "key %lu (0x%lX) is larger than max val",
+ key, key);
+ return;
+ }
+
+ initStringInfo(&buf);
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ radix_tree_node *child;
+
+ radix_tree_dump_node(node, level, &buf, false);
+
+ if (NodeIsLeaf(node))
+ {
+ int chunk = GET_KEY_CHUNK(key, node->shift);
+
+ /* We reached at a leaf node, find the corresponding slot */
+ radix_tree_find_slot_ptr(node, chunk);
+
+ break;
+ }
+
+ child = radix_tree_find_child(node, key);
+
+ if (child == NULL)
+ break;
+
+ node = child;
+ shift -= RADIX_TREE_NODE_FANOUT;
+ level++;
+ }
+
+ elog(WARNING, "\n%s", buf.data);
+}
+
+void
+radix_tree_dump(radix_tree *tree)
+{
+ StringInfoData buf;
+
+ initStringInfo(&buf);
+
+ elog(WARNING, "-----------------------------------------------------------");
+ elog(WARNING, "max_val = %lu", tree->max_val);
+ radix_tree_dump_node(tree->root, 0, &buf, true);
+ elog(WARNING, "\n%s", buf.data);
+ elog(WARNING, "-----------------------------------------------------------");
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..fe5a4fd79a
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,41 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RADIX_TREE_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct radix_tree_iter radix_tree_iter;
+
+extern radix_tree *radix_tree_create(MemoryContext ctx);
+extern Datum radix_tree_search(radix_tree *tree, uint64 key, bool *found);
+extern void radix_tree_destroy(radix_tree *tree);
+extern void radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p);
+extern uint64 radix_tree_memory_usage(radix_tree *tree);
+extern uint64 radix_tree_num_entries(radix_tree *tree);
+
+extern radix_tree_iter *radix_tree_begin_iterate(radix_tree *tree);
+extern bool radix_tree_iterate_next(radix_tree_iter *iter, uint64 *key_p, Datum *value_p);
+extern void radix_tree_end_iterate(radix_tree_iter *iter);
+
+
+#ifdef RADIX_TREE_DEBUG
+extern void radix_tree_dump(radix_tree *tree);
+extern void radix_tree_dump_search(radix_tree *tree, uint64 key);
+extern void radix_tree_stats(radix_tree *tree);
+#endif
+
+#endif /* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9090226daa..51b2514faf 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -24,6 +24,7 @@ SUBDIRS = \
test_parser \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..0c96ebc739
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,20 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..e9fe7e0124
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,397 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool intset_test_stats = true;
+
+static int radix_tree_node_max_entries[] = {4, 16, 48, 256};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 10000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void test_empty(void);
+
+static void
+test_empty(void)
+{
+ radix_tree *radixtree;
+ bool found;
+
+ radixtree = radix_tree_create(CurrentMemoryContext);
+
+ radix_tree_search(radixtree, 0, &found);
+ if (found)
+ elog(ERROR, "radix_tree_search on empty tree returned true");
+
+ radix_tree_search(radixtree, 1, &found);
+ if (found)
+ elog(ERROR, "radix_tree_search on empty tree returned true");
+
+ radix_tree_search(radixtree, PG_UINT64_MAX, &found);
+ if (found)
+ elog(ERROR, "radix_tree_search on empty tree returned true");
+
+ if (radix_tree_num_entries(radixtree) != 0)
+ elog(ERROR, "radix_tree_num_entries on empty tree return non-zero");
+
+ radix_tree_destroy(radixtree);
+}
+
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+ Datum val;
+
+ val = radix_tree_search(radixtree, key, &found);
+ if (!found)
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (DatumGetUInt64(val) != key)
+ elog(ERROR, "radix_tree_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, DatumGetUInt64(val), key);
+ }
+}
+
+static void
+test_node_types(uint8 shift)
+{
+ radix_tree *radixtree;
+ uint64 num_entries;
+
+ radixtree = radix_tree_create(CurrentMemoryContext);
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ radix_tree_insert(radixtree, key, Int64GetDatum(key), &found);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+ for (int j = 0; j < lengthof(radix_tree_node_max_entries); j++)
+ {
+ if (i == (radix_tree_node_max_entries[j] - 1))
+ {
+ check_search_on_node(radixtree, shift,
+ (j == 0) ? 0 : radix_tree_node_max_entries[j - 1],
+ radix_tree_node_max_entries[j]);
+ break;
+ }
+ }
+ }
+
+ num_entries = radix_tree_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "radix_tree_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec *spec)
+{
+ radix_tree *radixtree;
+ radix_tree_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (intset_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the integer set.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily. (intset_create() creates a memory context of its
+ * own, too, but we don't have direct access to it, so we cannot call
+ * MemoryContextStats() on it directly).
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+ radixtree = radix_tree_create(radixtree_ctx);
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ radix_tree_insert(radixtree, x, Int64GetDatum(x), &found);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (intset_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by intset_memory_usage(), as well as the
+ * stats from the memory context. They should be in the same ballpark,
+ * but it's hard to automate testing that, so if you're making changes to
+ * the implementation, just observe that manually.
+ */
+ if (intset_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by intset_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = radix_tree_memory_usage(radixtree);
+ fprintf(stderr, "radix_tree_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that intset_get_num_entries works */
+ n = radix_tree_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "radix_tree_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with intset_is_member()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ Datum v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to intset_is_member() ? */
+ v = radix_tree_search(radixtree, x, &found);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (DatumGetUInt64(v) != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ DatumGetUInt64(v), x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (intset_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = radix_tree_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!radix_tree_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ if (DatumGetUInt64(val) != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (intset_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
On Tue, May 10, 2022 at 8:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Overall, radix tree implementations have good numbers. Once we got an
agreement on moving in this direction, I'll start a new thread for
that and move the implementation further; there are many things to do
and discuss: deletion, API design, SIMD support, more tests etc.
+1
(FWIW, I think the current thread is still fine.)
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, May 10, 2022 at 6:58 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Tue, May 10, 2022 at 8:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Overall, radix tree implementations have good numbers. Once we got an
agreement on moving in this direction, I'll start a new thread for
that and move the implementation further; there are many things to do
and discuss: deletion, API design, SIMD support, more tests etc.+1
Thanks!
I've attached an updated version patch. It is still WIP but I've
implemented deletion and improved test cases and comments.
(FWIW, I think the current thread is still fine.)
Okay, agreed.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Attachments:
radixtree_wip_v2.patchapplication/octet-stream; name=radixtree_wip_v2.patchDownload
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..fd002d594a 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,9 @@ OBJS = \
integerset.o \
knapsack.o \
pairingheap.o \
+ radixtree.o \
rbtree.o \
+radixtree.o: CFLAGS+=-mavx2
+
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..ad08f45fd8
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,1632 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
+ * width SIMD vector, whereas 128-bit witdh SIMD vector is used in the paper.
+ * Also, there is no support for path compression and lazy path expansion. The
+ * radix tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * The key is a 64-bit unsigned integer and the value is a Datum. Both internal
+ * nodes and leaf nodes have the identical structure. For internal tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, also have the Datum value that is specified by the user.
+ *
+ * Interface
+ * ---------
+ *
+ * radix_tree_create - Create a new, empty radix tree
+ * radix_tree_destroy - Destroy the radix tree
+ * radix_tree_insert - Insert a key-value pair
+ * radix_tree_delete - Delete a key-value pair
+ * radix_tree_begin_iterate - Begin iterating through all key-value pairs
+ * radix_tree_iterate_next - Return next key-value pair, if any
+ * radix_tree_end_iterate - End iteration
+ *
+ * radix_tree_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * radix_tree_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "utils/memutils.h"
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+
+#if defined(__AVX2__)
+#include <immintrin.h> // x86 AVX2 intrinsics
+#endif
+
+/* The number of bits are encoded in one tree level */
+#define RADIX_TREE_NODE_FANOUT 8
+
+/* The number of maximum slots in the node, used in node-256 */
+#define RADIX_TREE_NODE_MAX_SLOTS (1 << RADIX_TREE_NODE_FANOUT)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * in node-128 and node-256.
+ */
+#define RADIX_TREE_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RADIX_TREE_CHUNK_MASK ((1 << RADIX_TREE_NODE_FANOUT) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RADIX_TREE_MAX_SHIFT key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RADIX_TREE_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RADIX_TREE_NODE_FANOUT)
+
+/* Get a chunk from the key */
+#define GET_KEY_CHUNK(key, shift) \
+ ((uint8) (((key) >> (shift)) & RADIX_TREE_CHUNK_MASK))
+
+/* Mapping from value to the bit in is-set bitmap in the node */
+#define NODE_BITMAP_BYTE(v) ((v) / RADIX_TREE_NODE_FANOUT)
+#define NODE_BITMAP_BIT(v) (UINT64_C(1) << ((v) % RADIX_TREE_NODE_FANOUT))
+
+/* Enum used radix_tree_node_search */
+typedef enum radix_tree_action
+{
+ RADIX_TREE_FIND = 0, /* find the key-value */
+ RADIX_TREE_DELETE, /* delete the key-value */
+} radix_tree_action;
+
+/*
+ * supported radix tree nodes.
+ *
+ * XXX: should we add KIND_16 as we can utilize SSE2 SIMD instructions?
+ */
+typedef enum radix_tree_node_kind
+{
+ RADIX_TREE_NODE_KIND_4 = 0,
+ RADIX_TREE_NODE_KIND_32,
+ RADIX_TREE_NODE_KIND_128,
+ RADIX_TREE_NODE_KIND_256
+} radix_tree_node_kind;
+#define RADIX_TREE_NODE_KIND_COUNT 4
+
+/*
+ * Base type for all nodes types.
+ */
+typedef struct radix_tree_node
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at ta fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Shift indicates which part of the key space is represented by this node.
+ * That is, the key is shifted by 'shift' and the lowest RADIX_TREE_NODE_FANOUT
+ * bits are then represented in chunk.
+ */
+ uint8 shift;
+ uint8 chunk;
+
+ /* Size class of the node */
+ radix_tree_node_kind kind;
+} radix_tree_node;
+/* Macros for radix tree nodes */
+#define IS_LEAF_NODE(n) (((radix_tree_node *) (n))->shift == 0)
+#define IS_EMPTY_NODE(n) (((radix_tree_node *) (n))->count == 0)
+#define HAS_FREE_SLOT(n) \
+ (((radix_tree_node *) (n))->count < \
+ radix_tree_node_info[((radix_tree_node *) (n))->kind].max_slots)
+
+/*
+ * To reduce memory usage compared to a simple radix tree with a fixed fanout
+ * we use adaptive node sides, with different storage methods for different
+ * numbers of elements.
+ */
+typedef struct radix_tree_node_4
+{
+ radix_tree_node n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+ Datum slots[4];
+} radix_tree_node_4;
+
+typedef struct radix_tree_node_32
+{
+ radix_tree_node n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+ Datum slots[32];
+} radix_tree_node_32;
+
+#define RADIX_TREE_NODE_128_BITS RADIX_TREE_NODE_NSLOTS_BITS(128)
+typedef struct radix_tree_node_128
+{
+ radix_tree_node n;
+
+ /*
+ * The index of slots for each fanout. 0 means unused whereas slots is
+ * 0-indexed. So we can get the slots of the chunk C by slots[C - 1].
+ */
+ uint8 slot_idxs[RADIX_TREE_NODE_MAX_SLOTS];
+
+ /* A bitmap to track which slot is in use */
+ uint8 isset[RADIX_TREE_NODE_128_BITS];
+ Datum slots[128];
+} radix_tree_node_128;
+
+#define RADIX_TREE_NODE_MAX_BITS RADIX_TREE_NODE_NSLOTS_BITS(RADIX_TREE_NODE_MAX_SLOTS)
+typedef struct radix_tree_node_256
+{
+ radix_tree_node n;
+
+ /* A bitmap to track which slot is in use */
+ uint8 isset[RADIX_TREE_NODE_MAX_BITS];
+
+ Datum slots[RADIX_TREE_NODE_MAX_SLOTS];
+} radix_tree_node_256;
+
+/* Information of each size class */
+typedef struct radix_tree_node_info_elem
+{
+ const char *name;
+ int max_slots;
+ Size size;
+} radix_tree_node_info_elem;
+
+static radix_tree_node_info_elem radix_tree_node_info[] =
+{
+ {"radix tree node 4", 4, sizeof(radix_tree_node_4)},
+ {"radix tree node 32", 32, sizeof(radix_tree_node_32)},
+ {"radix tree node 128", 128, sizeof(radix_tree_node_128)},
+ {"radix tree node 256", 256, sizeof(radix_tree_node_256)},
+};
+
+/*
+ * As we descend a radix tree, we push the node to the stack. The stack is used
+ * at deletion.
+ */
+typedef struct radix_tree_stack_data
+{
+ radix_tree_node *node;
+ struct radix_tree_stack_data *parent;
+} radix_tree_stack_data;
+typedef radix_tree_stack_data *radix_tree_stack;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending order
+ * of the key. To support this, the we iterate nodes of each level.
+ * radix_tree_iter_node_data struct is used to track the iteration within a node.
+ * radix_tree_iter has the array of this struct, stack, in order to track the iteration
+ * of every level. During the iteration, we also construct the key to return. The key
+ * is updated whenever we update the node iteration information, e.g., when advancing
+ * the current index within the node or when moving to the next node at the same level.
+ */
+typedef struct radix_tree_iter_node_data
+{
+ radix_tree_node *node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} radix_tree_iter_node_data;
+
+struct radix_tree_iter
+{
+ radix_tree *tree;
+
+ /* Track the iteration on nodes of each level */
+ radix_tree_iter_node_data stack[RADIX_TREE_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ radix_tree_node *root;
+ uint64 max_val;
+ uint64 num_keys;
+ MemoryContextData *slabs[RADIX_TREE_NODE_KIND_COUNT];
+
+ /* stats */
+ uint64 mem_used;
+ int32 cnt[RADIX_TREE_NODE_KIND_COUNT];
+};
+
+static radix_tree_node *radix_tree_node_grow(radix_tree *tree, radix_tree_node *parent,
+ radix_tree_node *node, uint64 key);
+static bool radix_tree_node_search_child(radix_tree_node *node, radix_tree_node **child_p,
+ uint64 key);
+static bool radix_tree_node_search(radix_tree_node *node, Datum **slot_p, uint64 key,
+ radix_tree_action action);
+static void radix_tree_extend(radix_tree *tree, uint64 key);
+static void radix_tree_new_root(radix_tree *tree, uint64 key, Datum val);
+static radix_tree_node *radix_tree_node_insert_child(radix_tree *tree,
+ radix_tree_node *parent,
+ radix_tree_node *node,
+ uint64 key);
+static void radix_tree_node_insert_val(radix_tree *tree, radix_tree_node *parent,
+ radix_tree_node *node, uint64 key, Datum val,
+ bool *replaced_p);
+static inline void radix_tree_iter_update_key(radix_tree_iter *iter, uint8 chunk, uint8 shift);
+static Datum radix_tree_node_iterate_next(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+ bool *found_p);
+static void radix_tree_store_iter_node(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+ radix_tree_node *node);
+static void radix_tree_update_iter_stack(radix_tree_iter *iter, int from);
+
+/*
+ * Helper functions for accessing each kind of nodes.
+ */
+static inline int
+node_32_search_eq(radix_tree_node_32 *node, uint8 chunk)
+{
+#ifdef __AVX2__
+ __m256i _key = _mm256_set1_epi8(chunk);
+ __m256i _data = _mm256_loadu_si256((__m256i_u *) node->chunks);
+ __m256i _cmp = _mm256_cmpeq_epi8(_key, _data);
+ uint32 bitfield = _mm256_movemask_epi8(_cmp);
+
+ bitfield &= ((UINT64_C(1) << node->n.count) - 1);
+
+ return (bitfield) ? __builtin_ctz(bitfield) : -1;
+
+#else
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] > chunk)
+ return -1;
+
+ if (node->chunks[i] == chunk)
+ return i;
+ }
+
+ return -1;
+#endif /* __AVX2__ */
+}
+
+/*
+ * This is a bit more complicated than search_chunk_array_16_eq(), because
+ * until recently no unsigned uint8 comparison instruction existed on x86. So
+ * we need to play some trickery using _mm_min_epu8() to effectively get
+ * <=. There never will be any equal elements in the current uses, but that's
+ * what we get here...
+ */
+static inline int
+node_32_search_le(radix_tree_node_32 *node, uint8 chunk)
+{
+#ifdef __AVX2__
+ __m256i _key = _mm256_set1_epi8(chunk);
+ __m256i _data = _mm256_loadu_si256((__m256i_u*) node->chunks);
+ __m256i _min = _mm256_min_epu8(_key, _data);
+ __m256i cmp = _mm256_cmpeq_epi8(_key, _min);
+ uint32_t bitfield=_mm256_movemask_epi8(cmp);
+
+ bitfield &= ((UINT64_C(1) << node->n.count) - 1);
+
+ return (bitfield) ? __builtin_ctz(bitfield) : node->n.count;
+#else
+ int index;
+
+ for (index = 0; index < node->n.count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+
+ return index;
+#endif /* __AVX2__ */
+}
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_128_is_chunk_used(radix_tree_node_128 *node, uint8 chunk)
+{
+ return (node->slot_idxs[chunk] != 0);
+}
+
+/* Is the slot in the node used */
+static inline bool
+node_128_is_slot_used(radix_tree_node_128 *node, uint8 slot)
+{
+ return ((node->isset[NODE_BITMAP_BYTE(slot)] & NODE_BITMAP_BIT(slot)) != 0);
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_128_set(radix_tree_node_128 *node, uint8 chunk, Datum slot)
+{
+ int slotpos = 0;
+
+ while (node_128_is_slot_used(node, slotpos))
+ slotpos++;
+ node->slot_idxs[chunk] = slotpos + 1;
+ node->slots[slotpos] = slot;
+ node->isset[NODE_BITMAP_BYTE(slotpos)] |= NODE_BITMAP_BIT(slotpos);
+}
+
+/* Delete the slot at the corresponding chunk */
+static inline void
+node_128_unset(radix_tree_node_128 *node, uint8 chunk)
+{
+ node->slot_idxs[chunk] = 0;
+ node->isset[NODE_BITMAP_BYTE(chunk)] &= ~(NODE_BITMAP_BIT(chunk));
+}
+
+/* Return the slot data corresponding to the chunk */
+static inline Datum
+node_128_get_chunk_slot(radix_tree_node_128 *node, uint8 chunk)
+{
+ return node->slots[node->slot_idxs[chunk] - 1];
+}
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_256_is_chunk_used(radix_tree_node_256 *node, uint8 chunk)
+{
+ return (node->isset[NODE_BITMAP_BYTE(chunk)] & NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_256_set(radix_tree_node_256 *node, uint8 chunk, Datum slot)
+{
+ node->slots[chunk] = slot;
+ node->isset[NODE_BITMAP_BYTE(chunk)] |= NODE_BITMAP_BIT(chunk);
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_256_unset(radix_tree_node_256 *node, uint8 chunk)
+{
+ node->isset[NODE_BITMAP_BYTE(chunk)] &= ~(NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+inline static int
+key_get_shift(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RADIX_TREE_NODE_FANOUT) * RADIX_TREE_NODE_FANOUT;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+ if (shift == RADIX_TREE_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64_C(1) << (shift + RADIX_TREE_NODE_FANOUT)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static radix_tree_node *
+radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind)
+{
+ radix_tree_node *newnode;
+
+ newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind],
+ radix_tree_node_info[kind].size);
+ newnode->kind = kind;
+
+ /* stats */
+ tree->mem_used += GetMemoryChunkSpace(newnode);
+ tree->cnt[kind]++;
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+radix_tree_free_node(radix_tree *tree, radix_tree_node *node)
+{
+ /* stats */
+ tree->mem_used -= GetMemoryChunkSpace(node);
+ tree->cnt[node->kind]--;
+
+ pfree(node);
+}
+
+/* Free a stack made by radix_tree_delete */
+static void
+radix_tree_free_stack(radix_tree_stack stack)
+{
+ radix_tree_stack ostack;
+
+ while (stack != NULL)
+ {
+ ostack = stack;
+ stack = stack->parent;
+ pfree(ostack);
+ }
+}
+
+/* Copy the common fields without the kind */
+static void
+radix_tree_copy_node_common(radix_tree_node *src, radix_tree_node *dst)
+{
+ dst->shift = src->shift;
+ dst->chunk = src->chunk;
+ dst->count = src->count;
+}
+
+/* The tree doesn't have not sufficient height, so grow it */
+static void
+radix_tree_extend(radix_tree *tree, uint64 key)
+{
+ int max_shift;
+ int shift = tree->root->shift + RADIX_TREE_NODE_FANOUT;
+
+ max_shift = key_get_shift(key);
+
+ /* Grow tree from 'shift' to 'max_shift' */
+ while (shift <= max_shift)
+ {
+ radix_tree_node_4 *node =
+ (radix_tree_node_4 *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_4);
+
+ node->n.count = 1;
+ node->n.shift = shift;
+ node->chunks[0] = 0;
+ node->slots[0] = PointerGetDatum(tree->root);
+
+ tree->root->chunk = 0;
+ tree->root = (radix_tree_node *) node;
+
+ shift += RADIX_TREE_NODE_FANOUT;
+ }
+
+ tree->max_val = shift_get_max_val(max_shift);
+}
+
+/*
+ * Wrapper for radix_tree_node_search to search the pointer to the child node in the
+ * node.
+ *
+ * Return true if the corresponding child is found, otherwise return false. On success,
+ * it sets child_p.
+ */
+static bool
+radix_tree_node_search_child(radix_tree_node *node, radix_tree_node **child_p, uint64 key)
+{
+ bool found = false;
+ Datum *slot_ptr;
+
+ if (radix_tree_node_search(node, &slot_ptr, key, RADIX_TREE_FIND))
+ {
+ /* Found the pointer to the child node */
+ found = true;
+ *child_p = (radix_tree_node *) DatumGetPointer(*slot_ptr);
+ }
+
+ return found;
+}
+
+/*
+ * Return true if the corresponding slot is used, otherwise return false. On success,
+ * sets the pointer to the slot to slot_p.
+ */
+static bool
+radix_tree_node_search(radix_tree_node *node, Datum **slot_p, uint64 key,
+ radix_tree_action action)
+{
+ int chunk = GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+
+ switch (node->kind)
+ {
+ case RADIX_TREE_NODE_KIND_4:
+ {
+ radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+
+ /* Do linear search */
+ for (int i = 0; i < n4->n.count; i++)
+ {
+ if (n4->chunks[i] > chunk)
+ break;
+
+ if (n4->chunks[i] == chunk)
+ {
+ if (action == RADIX_TREE_FIND)
+ *slot_p = &(n4->slots[i]);
+ else /* RADIX_TREE_DELETE */
+ {
+ memmove(&(n4->chunks[i]), &(n4->chunks[i + 1]),
+ sizeof(uint8) * (n4->n.count - i - 1));
+ memmove(&(n4->slots[i]), &(n4->slots[i + 1]),
+ sizeof(radix_tree_node *) * (n4->n.count - i - 1));
+ }
+
+ found = true;
+ break;
+ }
+ }
+
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_32:
+ {
+ radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+ int idx;
+
+ /* Search by SIMD instructions */
+ idx = node_32_search_eq(n32, chunk);
+
+ if (idx >= 0)
+ {
+ if (action == RADIX_TREE_FIND)
+ *slot_p = &(n32->slots[idx]);
+ else /* RADIX_TREE_DELETE */
+ {
+ memmove(&(n32->chunks[idx]), &(n32->chunks[idx + 1]),
+ sizeof(uint8) * (n32->n.count - idx - 1));
+ memmove(&(n32->slots[idx]), &(n32->slots[idx + 1]),
+ sizeof(radix_tree_node *) * (n32->n.count - idx - 1));
+ }
+
+ found = true;
+ }
+
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_128:
+ {
+ radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+
+ if (node_128_is_chunk_used(n128, chunk))
+ {
+ if (action == RADIX_TREE_FIND)
+ *slot_p = &(n128->slots[n128->slot_idxs[chunk] - 1]);
+ else /* RADIX_TREE_DELETE */
+ node_128_unset(n128, chunk);
+
+ found = true;
+ }
+
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_256:
+ {
+ radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+
+ if (node_256_is_chunk_used(n256, chunk))
+ {
+ if (action == RADIX_TREE_FIND)
+ *slot_p = &(n256->slots[chunk]);
+ else /* RADIX_TREE_DELETE */
+ node_256_unset(n256, chunk);
+
+ found = true;
+ }
+
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (action == RADIX_TREE_DELETE && found)
+ node->count--;
+
+ return found;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+radix_tree_new_root(radix_tree *tree, uint64 key, Datum val)
+{
+ radix_tree_node_4 * n4 =
+ (radix_tree_node_4 * ) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_4);
+ int shift = key_get_shift(key);
+
+ n4->n.shift = shift;
+ tree->max_val = shift_get_max_val(shift);
+ tree->root = (radix_tree_node *) n4;
+}
+
+/* Insert 'node' as a child node of 'parent' */
+static radix_tree_node *
+radix_tree_node_insert_child(radix_tree *tree, radix_tree_node *parent,
+ radix_tree_node *node, uint64 key)
+{
+ radix_tree_node *newchild =
+ (radix_tree_node *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_4);
+
+ Assert(!IS_LEAF_NODE(node));
+
+ newchild->shift = node->shift - RADIX_TREE_NODE_FANOUT;
+ newchild->chunk = GET_KEY_CHUNK(key, node->shift);
+
+ radix_tree_node_insert_val(tree, parent, node, key, PointerGetDatum(newchild), NULL);
+
+ return (radix_tree_node *) newchild;
+}
+
+/*
+ * Insert the value to the node. The node grows if it's full.
+ */
+static void
+radix_tree_node_insert_val(radix_tree *tree, radix_tree_node *parent,
+ radix_tree_node *node, uint64 key, Datum val,
+ bool *replaced_p)
+{
+ int chunk = GET_KEY_CHUNK(key, node->shift);
+ bool replaced = false;
+
+ switch (node->kind)
+ {
+ case RADIX_TREE_NODE_KIND_4:
+ {
+ radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+ int idx;
+
+ for (idx = 0; idx < n4->n.count; idx++)
+ {
+ if (n4->chunks[idx] >= chunk)
+ break;
+ }
+
+ if (HAS_FREE_SLOT(n4))
+ {
+ if (n4->n.count == 0)
+ {
+ /* the first key for this node, add it */
+ }
+ else if (n4->chunks[idx] == chunk)
+ {
+ /* found the key, replace it */
+ replaced = true;
+ }
+ else if (idx != n4->n.count)
+ {
+ /*
+ * the key needs to be inserted in the middle of the array,
+ * make space for the new key.
+ */
+ memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]),
+ sizeof(uint8) * (n4->n.count - idx));
+ memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]),
+ sizeof(radix_tree_node *) * (n4->n.count - idx));
+ }
+
+ n4->chunks[idx] = chunk;
+ n4->slots[idx] = val;
+
+ /* Done */
+ break;
+ }
+
+ /* The node needs to grow */
+ node = radix_tree_node_grow(tree, parent, node, key);
+ Assert(node->kind == RADIX_TREE_NODE_KIND_32);
+ }
+ /* FALLTHROUGH */
+ case RADIX_TREE_NODE_KIND_32:
+ {
+ radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+ int idx;
+
+ idx = node_32_search_le(n32, chunk);
+
+ if (HAS_FREE_SLOT(n32))
+ {
+ if (n32->n.count == 0)
+ {
+ /* first key for this node, add it */
+ }
+ else if (n32->chunks[idx] == chunk)
+ {
+ /* found the key, replace it */
+ replaced = true;
+ }
+ else if (idx != n32->n.count)
+ {
+ /*
+ * the key needs to be inserted in the middle of the array,
+ * make space for the new key.
+ */
+ memmove(&(n32->chunks[idx + 1]), &(n32->chunks[idx]),
+ sizeof(uint8) * (n32->n.count - idx));
+ memmove(&(n32->slots[idx + 1]), &(n32->slots[idx]),
+ sizeof(radix_tree_node *) * (n32->n.count - idx));
+ }
+
+ n32->chunks[idx] = chunk;
+ n32->slots[idx] = val;
+ break;
+ }
+
+ /* The node needs to grow */
+ node = radix_tree_node_grow(tree, parent, node, key);
+ Assert(node->kind == RADIX_TREE_NODE_KIND_128);
+ }
+ /* FALLTHROUGH */
+ case RADIX_TREE_NODE_KIND_128:
+ {
+ radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+
+ if (node_128_is_chunk_used(n128, chunk))
+ {
+ /* found the existing value */
+ node_128_set(n128, chunk, val);
+ replaced = true;
+ break;
+ }
+
+ if (HAS_FREE_SLOT(n128))
+ {
+ node_128_set(n128, chunk, val);
+ break;
+ }
+
+ node = radix_tree_node_grow(tree, parent, node, key);
+ Assert(node->kind == RADIX_TREE_NODE_KIND_256);
+ }
+ /* FALLTHROUGH */
+ case RADIX_TREE_NODE_KIND_256:
+ {
+ radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+
+ if (node_256_is_chunk_used(n256, chunk))
+ replaced = true;
+
+ node_256_set(n256, chunk, val);
+ break;
+ }
+ }
+
+ if (!replaced)
+ node->count++;
+
+ if (replaced_p)
+ *replaced_p = replaced;
+}
+
+/* Change the node type to a larger one */
+static radix_tree_node *
+radix_tree_node_grow(radix_tree *tree, radix_tree_node *parent, radix_tree_node *node,
+ uint64 key)
+{
+ radix_tree_node *newnode = NULL;
+
+ Assert(node->count ==
+ radix_tree_node_info[node->kind].max_slots);
+
+ switch (node->kind)
+ {
+ case RADIX_TREE_NODE_KIND_4:
+ {
+ radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+ radix_tree_node_32 *new32 =
+ (radix_tree_node_32 *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_32);
+
+ radix_tree_copy_node_common((radix_tree_node *) n4,
+ (radix_tree_node *) new32);
+
+ memcpy(&(new32->chunks), &(n4->chunks), sizeof(uint8) * 4);
+ memcpy(&(new32->slots), &(n4->slots), sizeof(Datum) * 4);
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ /* Check if the chunks in the new node are sorted */
+ for (int i = 1; i < new32->n.count ; i++)
+ Assert(new32->chunks[i - 1] <= new32->chunks[i]);
+ Assert(new32->n.count == 4);
+ }
+#endif
+
+ newnode = (radix_tree_node *) new32;
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_32:
+ {
+ radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+ radix_tree_node_128 *new128 =
+ (radix_tree_node_128 *) radix_tree_alloc_node(tree,RADIX_TREE_NODE_KIND_128);
+
+ radix_tree_copy_node_common((radix_tree_node *) n32,
+ (radix_tree_node *) new128);
+
+ for (int i = 0; i < n32->n.count; i++)
+ node_128_set(new128, n32->chunks[i], n32->slots[i]);
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ for (int i = 0; i < n32->n.count; i++)
+ Assert(node_128_is_chunk_used(new128, n32->chunks[i]));
+ Assert(new128->n.count == 32);
+ }
+#endif
+
+ newnode = (radix_tree_node *) new128;
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_128:
+ {
+ radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+ radix_tree_node_256 *new256 =
+ (radix_tree_node_256 *) radix_tree_alloc_node(tree,RADIX_TREE_NODE_KIND_256);
+ int cnt = 0;
+
+ radix_tree_copy_node_common((radix_tree_node *) n128,
+ (radix_tree_node *) new256);
+
+ for (int i = 0; i < 256 && cnt < n128->n.count; i++)
+ {
+ if (!node_128_is_chunk_used(n128, i))
+ continue;
+
+ node_256_set(new256, i, node_128_get_chunk_slot(n128, i));
+ cnt++;
+ }
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ int n = 0;
+ for (int i = 0; i < RADIX_TREE_NODE_MAX_BITS; i++)
+ n += pg_popcount32(new256->isset[i]);
+
+ Assert(new256->n.count == n);
+ }
+#endif
+
+ newnode = (radix_tree_node *) new256;
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_256:
+ elog(ERROR, "radix tree node_256 cannot be grew");
+ break;
+ }
+
+ if (parent == node)
+ {
+ /* Replace the root node with the new large node */
+ tree->root = newnode;
+ }
+ else
+ {
+ Datum *slot_ptr = NULL;
+
+ /* Redirect from the parent to the node */
+ radix_tree_node_search(parent, &slot_ptr, key, RADIX_TREE_FIND);
+ Assert(*slot_ptr);
+ *slot_ptr = PointerGetDatum(newnode);
+ }
+
+ radix_tree_free_node(tree, node);
+
+ return newnode;
+}
+
+radix_tree *
+radix_tree_create(MemoryContext ctx)
+{
+ radix_tree *tree;
+ MemoryContext old_ctx;
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = palloc(sizeof(radix_tree));
+ tree->max_val = 0;
+ tree->root = NULL;
+ tree->context = ctx;
+ tree->num_keys = 0;
+ tree->mem_used = 0;
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RADIX_TREE_NODE_KIND_COUNT; i++)
+ {
+ tree->slabs[i] = SlabContextCreate(ctx,
+ radix_tree_node_info[i].name,
+ SLAB_DEFAULT_BLOCK_SIZE,
+ radix_tree_node_info[i].size);
+ tree->cnt[i] = 0;
+ }
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+void
+radix_tree_destroy(radix_tree *tree)
+{
+ for (int i = 0; i < RADIX_TREE_NODE_KIND_COUNT; i++)
+ MemoryContextDelete(tree->slabs[i]);
+
+ pfree(tree);
+}
+
+/*
+ * Insert the key with the val.
+ *
+ * found_p is set to true if the key already present, otherwise false, if
+ * it's not NULL.
+ *
+ * XXX: do we need to support update_if_exists behavior?
+ */
+void
+radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
+{
+ int shift;
+ bool replaced;
+ radix_tree_node *node;
+ radix_tree_node *parent = tree->root;
+
+ /* Empty tree, create the root */
+ if (!tree->root)
+ radix_tree_new_root(tree, key, val);
+
+ /* Extend the tree if necessary */
+ if (key > tree->max_val)
+ radix_tree_extend(tree, key);
+
+ Assert(tree->root);
+
+ shift = tree->root->shift;
+ node = tree->root;
+ while (shift > 0)
+ {
+ radix_tree_node *child;
+
+ if (!radix_tree_node_search_child(node, &child, key))
+ child = radix_tree_node_insert_child(tree, parent, node, key);
+
+ Assert(child != NULL);
+
+ parent = node;
+ node = child;
+ shift -= RADIX_TREE_NODE_FANOUT;
+ }
+
+ /* arrived at a leaf */
+ Assert(IS_LEAF_NODE(node));
+
+ radix_tree_node_insert_val(tree, parent, node, key, val, &replaced);
+
+ /* Update the statistics */
+ if (!replaced)
+ tree->num_keys++;
+
+ if (found_p)
+ *found_p = replaced;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if the key is successfully
+ * found, otherwise return false. On success, we set the value to *val_p so
+ * it must not be NULL.
+ */
+bool
+radix_tree_search(radix_tree *tree, uint64 key, Datum *val_p)
+{
+ radix_tree_node *node;
+ Datum *value_ptr;
+ int shift;
+
+ Assert(val_p);
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift > 0)
+ {
+ radix_tree_node *child;
+
+ if (!radix_tree_node_search_child(node, &child, key))
+ return false;
+
+ node = child;
+ shift -= RADIX_TREE_NODE_FANOUT;
+ }
+
+ /* We reached at a leaf node, search the corresponding slot */
+ Assert(IS_LEAF_NODE(node));
+
+ if (!radix_tree_node_search(node, &value_ptr, key, RADIX_TREE_FIND))
+ return false;
+
+ /* Found, set the value to return */
+ *val_p = *value_ptr;
+ return true;
+}
+
+bool
+radix_tree_delete(radix_tree *tree, uint64 key)
+{
+ radix_tree_node *node;
+ int shift;
+ radix_tree_stack stack = NULL;
+ bool deleted;
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ radix_tree_node *child;
+ radix_tree_stack new_stack;
+
+ new_stack = (radix_tree_stack) palloc(sizeof(radix_tree_stack_data));
+ new_stack->node = node;
+ new_stack->parent = stack;
+ stack = new_stack;
+
+ if (IS_LEAF_NODE(node))
+ break;
+
+ if (!radix_tree_node_search_child(node, &child, key))
+ {
+ radix_tree_free_stack(stack);
+ return false;
+ }
+
+ node = child;
+ shift -= RADIX_TREE_NODE_FANOUT;
+ }
+
+ Assert(IS_LEAF_NODE(stack->node));
+ while (stack != NULL)
+ {
+ radix_tree_node *node = stack->node;
+ Datum *slot;
+
+ stack = stack->parent;
+
+ deleted = radix_tree_node_search(node, &slot, key, RADIX_TREE_DELETE);
+
+ if (!IS_EMPTY_NODE(node))
+ break;
+
+ Assert(deleted);
+ radix_tree_free_node(tree, node);
+ }
+
+ if (deleted)
+ tree->num_keys--;
+
+ radix_tree_free_stack(stack);
+ return deleted;
+}
+
+/* Create and return the iterator for the given radix tree */
+radix_tree_iter *
+radix_tree_begin_iterate(radix_tree *tree)
+{
+ MemoryContext old_ctx;
+ radix_tree_iter *iter;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (radix_tree_iter *) palloc0(sizeof(radix_tree_iter));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree)
+ return iter;
+
+ top_level = iter->tree->root->shift / RADIX_TREE_NODE_FANOUT;
+
+ iter->stack_len = top_level;
+ iter->stack[top_level].node = iter->tree->root;
+ iter->stack[top_level].current_idx = -1;
+
+ /* Descend to the left most leaf node from the root */
+ radix_tree_update_iter_stack(iter, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+bool
+radix_tree_iterate_next(radix_tree_iter *iter, uint64 *key_p, Datum *value_p)
+{
+ bool found = false;
+ Datum slot = (Datum) 0;
+ int level;
+
+ /* Empty tree */
+ if (!iter->tree)
+ return false;
+
+ for (;;)
+ {
+ radix_tree_node *node;
+ radix_tree_iter_node_data *node_iter;
+
+ /*
+ * Iterate node at each level from the bottom of the tree until we search
+ * the next slot.
+ */
+ for (level = 0; level <= iter->stack_len; level++)
+ {
+ slot = radix_tree_node_iterate_next(iter, &(iter->stack[level]), &found);
+
+ if (found)
+ break;
+ }
+
+ /* end of iteration */
+ if (!found)
+ return false;
+
+ /* found the next slot at the leaf node, return it */
+ if (level == 0)
+ {
+ *key_p = iter->key;
+ *value_p = slot;
+ return true;
+ }
+
+ /*
+ * We have advanced more than one nodes including internal nodes. So we need
+ * to update the stack by descending to the left most leaf node from this level.
+ */
+ node = (radix_tree_node *) DatumGetPointer(slot);
+ node_iter = &(iter->stack[level - 1]);
+ radix_tree_store_iter_node(iter, node_iter, node);
+
+ radix_tree_update_iter_stack(iter, level - 1);
+ }
+}
+
+void
+radix_tree_end_iterate(radix_tree_iter *iter)
+{
+ pfree(iter);
+}
+
+/*
+ * Update the part of the key being constructed during the iteration with the
+ * given chunk
+ */
+static inline void
+radix_tree_iter_update_key(radix_tree_iter *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RADIX_TREE_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Iterate over the given radix tree node and returns the next slot of the given
+ * node and set true to *found_p, if any. Otherwise, set false to *found_p.
+ */
+static Datum
+radix_tree_node_iterate_next(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+ bool *found_p)
+{
+ radix_tree_node *node = node_iter->node;
+ Datum slot = (Datum) 0;
+
+ switch (node->kind)
+ {
+ case RADIX_TREE_NODE_KIND_4:
+ {
+ radix_tree_node_4 *n4 = (radix_tree_node_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+
+ if (node_iter->current_idx >= n4->n.count)
+ goto not_found;
+
+ slot = n4->slots[node_iter->current_idx];
+
+ /* Update the part of the key with the current chunk */
+ if (IS_LEAF_NODE(node))
+ radix_tree_iter_update_key(iter, n4->chunks[node_iter->current_idx], 0);
+
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_32:
+ {
+ radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+
+ node_iter->current_idx++;
+
+ if (node_iter->current_idx >= n32->n.count)
+ goto not_found;
+
+ slot = n32->slots[node_iter->current_idx];
+
+ /* Update the part of the key with the current chunk */
+ if (IS_LEAF_NODE(node))
+ radix_tree_iter_update_key(iter, n32->chunks[node_iter->current_idx], 0);
+
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_128:
+ {
+ radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < 256; i++)
+ {
+ if (node_128_is_chunk_used(n128, i))
+ break;
+ }
+
+ if (i >= 256)
+ goto not_found;
+
+ node_iter->current_idx = i;
+ slot = node_128_get_chunk_slot(n128, i);
+
+ /* Update the part of the key */
+ if (IS_LEAF_NODE(node))
+ radix_tree_iter_update_key(iter, node_iter->current_idx, 0);
+
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_256:
+ {
+ radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < 256; i++)
+ {
+ if (node_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= 256)
+ goto not_found;
+
+ node_iter->current_idx = i;
+ slot = n256->slots[i];
+
+ /* Update the part of the key */
+ if (IS_LEAF_NODE(node))
+ radix_tree_iter_update_key(iter, node_iter->current_idx, 0);
+
+ break;
+ }
+ }
+
+ *found_p = true;
+ return slot;
+
+not_found:
+ *found_p = false;
+ return (Datum) 0;
+}
+
+/*
+ * Initialize and update the node iteration struct with the given radix tree node.
+ * This function also updates the part of the key with the chunk of the given node.
+ */
+static void
+radix_tree_store_iter_node(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+ radix_tree_node *node)
+{
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ radix_tree_iter_update_key(iter, node->chunk, node->shift + RADIX_TREE_NODE_FANOUT);
+}
+
+/*
+ * Build the stack of the radix tree node while descending to the leaf from the 'from'
+ * level.
+ */
+static void
+radix_tree_update_iter_stack(radix_tree_iter *iter, int from)
+{
+ radix_tree_node *node = iter->stack[from].node;
+ int level = from;
+
+ for (;;)
+ {
+ radix_tree_iter_node_data *node_iter = &(iter->stack[level--]);
+ bool found;
+
+ /* Set the current node */
+ radix_tree_store_iter_node(iter, node_iter, node);
+
+ if (IS_LEAF_NODE(node))
+ break;
+
+ node = (radix_tree_node *)
+ DatumGetPointer(radix_tree_node_iterate_next(iter, node_iter, &found));
+
+ /*
+ * Since we always get the first slot in the node, we have to found
+ * the slot.
+ */
+ Assert(found);
+ }
+}
+
+uint64
+radix_tree_num_entries(radix_tree *tree)
+{
+ return tree->num_keys;
+}
+
+uint64
+radix_tree_memory_usage(radix_tree *tree)
+{
+ return tree->mem_used;
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RADIX_TREE_DEBUG
+void
+radix_tree_stats(radix_tree *tree)
+{
+ fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u(%lu), n32 = %u(%lu), n128 = %u(%lu), n256 = %u(%lu)",
+ tree->num_keys,
+ tree->root->shift / RADIX_TREE_NODE_FANOUT,
+ tree->cnt[0], tree->cnt[0] * sizeof(radix_tree_node_4),
+ tree->cnt[1], tree->cnt[1] * sizeof(radix_tree_node_32),
+ tree->cnt[2], tree->cnt[2] * sizeof(radix_tree_node_128),
+ tree->cnt[3], tree->cnt[3] * sizeof(radix_tree_node_256));
+ //radix_tree_dump(tree);
+}
+
+static void
+radix_tree_print_slot(StringInfo buf, uint8 chunk, Datum slot, int idx, bool is_leaf, int level)
+{
+ char space[128] = {0};
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ if (is_leaf)
+ appendStringInfo(buf, "%s[%d] \"0x%X\" val(0x%lX) LEAF\n",
+ space,
+ idx,
+ chunk,
+ DatumGetInt64(slot));
+ else
+ appendStringInfo(buf , "%s[%d] \"0x%X\" -> ",
+ space,
+ idx,
+ chunk);
+}
+
+static void
+radix_tree_dump_node(radix_tree_node *node, int level, StringInfo buf, bool recurse)
+{
+ bool is_leaf = IS_LEAF_NODE(node);
+
+ appendStringInfo(buf, "[\"%s\" type %d, cnt %u, shift %u, chunk \"0x%X\"] chunks:\n",
+ IS_LEAF_NODE(node) ? "LEAF" : "INNR",
+ (node->kind == RADIX_TREE_NODE_KIND_4) ? 4 :
+ (node->kind == RADIX_TREE_NODE_KIND_32) ? 32 :
+ (node->kind == RADIX_TREE_NODE_KIND_128) ? 128 : 256,
+ node->count, node->shift, node->chunk);
+
+ switch (node->kind)
+ {
+ case RADIX_TREE_NODE_KIND_4:
+ {
+ radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+
+ for (int i = 0; i < n4->n.count; i++)
+ {
+ radix_tree_print_slot(buf, n4->chunks[i], n4->slots[i], i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ radix_tree_dump_node((radix_tree_node *) n4->slots[i], level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_32:
+ {
+ radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+
+ for (int i = 0; i < n32->n.count; i++)
+ {
+ radix_tree_print_slot(buf, n32->chunks[i], n32->slots[i], i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ radix_tree_dump_node((radix_tree_node *) n32->slots[i], level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_128:
+ {
+ radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+
+ for (int j = 0; j < 256; j++)
+ {
+ if (!node_128_is_chunk_used(n128, j))
+ continue;
+
+ appendStringInfo(buf, "slot_idxs[%d]=%d, ", j, n128->slot_idxs[j]);
+ }
+ appendStringInfo(buf, "\nisset-bitmap:");
+ for (int j = 0; j < 16; j++)
+ {
+ appendStringInfo(buf, "%X ", (uint8) n128->isset[j]);
+ }
+ appendStringInfo(buf, "\n");
+
+ for (int i = 0; i < 256; i++)
+ {
+ if (!node_128_is_chunk_used(n128, i))
+ continue;
+
+ radix_tree_print_slot(buf, i, node_128_get_chunk_slot(n128, i),
+ i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ radix_tree_dump_node((radix_tree_node *) node_128_get_chunk_slot(n128, i),
+ level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_256:
+ {
+ radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+
+ for (int i = 0; i < 256; i++)
+ {
+ if (!node_256_is_chunk_used(n256, i))
+ continue;
+
+ radix_tree_print_slot(buf, i, n256->slots[i], i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ radix_tree_dump_node((radix_tree_node *) n256->slots[i], level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+radix_tree_dump_search(radix_tree *tree, uint64 key)
+{
+ StringInfoData buf;
+ radix_tree_node *node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+ if (!tree->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->max_val)
+ {
+ elog(NOTICE, "key %lu (0x%lX) is larger than max val",
+ key, key);
+ return;
+ }
+
+ initStringInfo(&buf);
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ radix_tree_node *child;
+
+ radix_tree_dump_node(node, level, &buf, false);
+
+ if (IS_LEAF_NODE(node))
+ {
+ Datum *dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ radix_tree_node_search(node, &dummy, key, RADIX_TREE_FIND);
+
+ break;
+ }
+
+ if (!radix_tree_node_search_child(node, &child, key))
+ break;
+
+ node = child;
+ shift -= RADIX_TREE_NODE_FANOUT;
+ level++;
+ }
+
+ elog(NOTICE, "\n%s", buf.data);
+}
+
+void
+radix_tree_dump(radix_tree *tree)
+{
+ StringInfoData buf;
+
+ initStringInfo(&buf);
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = %lu", tree->max_val);
+ radix_tree_dump_node(tree->root, 0, &buf, true);
+ elog(NOTICE, "\n%s", buf.data);
+ elog(NOTICE, "-----------------------------------------------------------");
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..c072f8ea98
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RADIX_TREE_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct radix_tree_iter radix_tree_iter;
+
+extern radix_tree *radix_tree_create(MemoryContext ctx);
+extern bool radix_tree_search(radix_tree *tree, uint64 key, Datum *val_p);
+extern void radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p);
+extern bool radix_tree_delete(radix_tree *tree, uint64 key);
+extern void radix_tree_destroy(radix_tree *tree);
+extern uint64 radix_tree_memory_usage(radix_tree *tree);
+extern uint64 radix_tree_num_entries(radix_tree *tree);
+
+extern radix_tree_iter *radix_tree_begin_iterate(radix_tree *tree);
+extern bool radix_tree_iterate_next(radix_tree_iter *iter, uint64 *key_p, Datum *value_p);
+extern void radix_tree_end_iterate(radix_tree_iter *iter);
+
+
+#ifdef RADIX_TREE_DEBUG
+extern void radix_tree_dump(radix_tree *tree);
+extern void radix_tree_dump_search(radix_tree *tree, uint64 key);
+extern void radix_tree_stats(radix_tree *tree);
+#endif
+
+#endif /* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9090226daa..51b2514faf 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -24,6 +24,7 @@ SUBDIRS = \
test_parser \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_integerset.out b/src/test/modules/test_radixtree/expected/test_integerset.out
new file mode 100644
index 0000000000..822dd031e9
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_integerset.out
@@ -0,0 +1,31 @@
+CREATE EXTENSION test_integerset;
+--
+-- All the logic is in the test_integerset() function. It will throw
+-- an error if something fails.
+--
+SELECT test_integerset();
+NOTICE: testing intset with empty set
+NOTICE: testing intset with distances > 2^60 between values
+NOTICE: testing intset with single value 0
+NOTICE: testing intset with single value 1
+NOTICE: testing intset with single value 18446744073709551614
+NOTICE: testing intset with single value 18446744073709551615
+NOTICE: testing intset with value 0, and all between 1000 and 2000
+NOTICE: testing intset with value 1, and all between 1000 and 2000
+NOTICE: testing intset with value 1, and all between 1000 and 2000000
+NOTICE: testing intset with value 18446744073709551614, and all between 1000 and 2000
+NOTICE: testing intset with value 18446744073709551615, and all between 1000 and 2000
+NOTICE: testing intset with pattern "all ones"
+NOTICE: testing intset with pattern "alternating bits"
+NOTICE: testing intset with pattern "clusters of ten"
+NOTICE: testing intset with pattern "clusters of hundred"
+NOTICE: testing intset with pattern "one-every-64k"
+NOTICE: testing intset with pattern "sparse"
+NOTICE: testing intset with pattern "single values, distance > 2^32"
+NOTICE: testing intset with pattern "clusters, distance > 2^32"
+NOTICE: testing intset with pattern "clusters, distance > 2^60"
+ test_integerset
+-----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..0c96ebc739
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,20 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..e93c7f6676
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,446 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool radix_tree_test_stats = true;
+
+static int radix_tree_node_max_entries[] = {4, 16, 128, 256};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 10000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void test_empty(void);
+
+static void
+test_empty(void)
+{
+ radix_tree *radixtree;
+ Datum dummy;
+
+ radixtree = radix_tree_create(CurrentMemoryContext);
+
+ if (radix_tree_search(radixtree, 0, &dummy))
+ elog(ERROR, "radix_tree_search on empty tree returned true");
+
+ if (radix_tree_search(radixtree, 1, &dummy))
+ elog(ERROR, "radix_tree_search on empty tree returned true");
+
+ if (radix_tree_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "radix_tree_search on empty tree returned true");
+
+ if (radix_tree_num_entries(radixtree) != 0)
+ elog(ERROR, "radix_tree_num_entries on empty tree return non-zero");
+
+ radix_tree_destroy(radixtree);
+}
+
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ Datum val;
+
+ if (!radix_tree_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (DatumGetUInt64(val) != key)
+ elog(ERROR, "radix_tree_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, DatumGetUInt64(val), key);
+ }
+}
+
+static void
+test_node_types(uint8 shift)
+{
+ radix_tree *radixtree;
+ uint64 num_entries;
+
+ radixtree = radix_tree_create(CurrentMemoryContext);
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ radix_tree_insert(radixtree, key, Int64GetDatum(key), &found);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+ for (int j = 0; j < lengthof(radix_tree_node_max_entries); j++)
+ {
+ if (i == (radix_tree_node_max_entries[j] - 1))
+ {
+ check_search_on_node(radixtree, shift,
+ (j == 0) ? 0 : radix_tree_node_max_entries[j - 1],
+ radix_tree_node_max_entries[j]);
+ break;
+ }
+ }
+ }
+
+ num_entries = radix_tree_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "radix_tree_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec *spec)
+{
+ radix_tree *radixtree;
+ radix_tree_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (radix_tree_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the integer set.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+ radixtree = radix_tree_create(radixtree_ctx);
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ radix_tree_insert(radixtree, x, Int64GetDatum(x), &found);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (radix_tree_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by radix_tree_memory_usage(), as well as the
+ * stats from the memory context. They should be in the same ballpark,
+ * but it's hard to automate testing that, so if you're making changes to
+ * the implementation, just observe that manually.
+ */
+ if (radix_tree_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by radix_tree_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = radix_tree_memory_usage(radixtree);
+ fprintf(stderr, "radix_tree_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that radix_tree_num_entries works */
+ n = radix_tree_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "radix_tree_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with radix_tree_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ Datum v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to radix_tree_search() ? */
+ found = radix_tree_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (DatumGetUInt64(v) != x))
+ {
+ radix_tree_dump_search(radixtree, x);
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ DatumGetUInt64(v), x);
+ }
+ }
+ endtime = GetCurrentTimestamp();
+ if (radix_tree_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = radix_tree_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!radix_tree_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ if (DatumGetUInt64(val) != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (radix_tree_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with radix_tree_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = radix_tree_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ uint64 x;
+ Datum v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to radix_tree_search() ? */
+ found = radix_tree_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!radix_tree_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (radix_tree_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (radix_tree_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (radix_tree_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = radix_tree_num_entries(radixtree);
+
+ /* Check that radix_tree_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "radix_tree_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
On Wed, May 25, 2022 at 11:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, May 10, 2022 at 6:58 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Tue, May 10, 2022 at 8:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Overall, radix tree implementations have good numbers. Once we got an
agreement on moving in this direction, I'll start a new thread for
that and move the implementation further; there are many things to do
and discuss: deletion, API design, SIMD support, more tests etc.+1
Thanks!
I've attached an updated version patch. It is still WIP but I've
implemented deletion and improved test cases and comments.
I've attached an updated version patch that changes the configure
script. I'm still studying how to support AVX2 on msvc build. Also,
added more regression tests.
The integration with lazy vacuum and parallel vacuum is missing for
now. In order to support parallel vacuum, we need to have the radix
tree support to be created on DSA.
Added this item to the next CF.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Attachments:
radixtree_wip_v3.patchapplication/x-patch; name=radixtree_wip_v3.patchDownload
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index d3562d6fee..a56d6e89da 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -676,3 +676,27 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_ARMV8_CRC32C_INTRINSICS
+
+# PGAC_AVX2_INTRINSICS
+# --------------------
+# Check if the compiler supports the Intel AVX2 instructinos.
+#
+# If the intrinsics are supported, sets pgac_avx2_intrinsics, and CFLAGS_AVX2.
+AC_DEFUN([PGAC_AVX2_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx2_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm256_set_1_epi8 _mm256_cmpeq_epi8 _mm256_movemask_epi8 CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [__m256i vec = _mm256_set1_epi8(0);
+ __m256i cmp = _mm256_cmpeq_epi8(vec, vec);
+ return _mm256_movemask_epi8(cmp) > 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_AVX2="$1"
+ pgac_avx2_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX2_INTRINSICS
diff --git a/configure b/configure
index 7dec6b7bf9..6ebc15a8c1 100755
--- a/configure
+++ b/configure
@@ -645,6 +645,7 @@ XGETTEXT
MSGMERGE
MSGFMT_FLAGS
MSGFMT
+CFLAGS_AVX2
PG_CRC32C_OBJS
CFLAGS_ARMV8_CRC32C
CFLAGS_SSE42
@@ -18829,6 +18830,82 @@ $as_echo "slicing-by-8" >&6; }
fi
+# Check for Intel AVX2 intrinsics.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm256i CFLAGS=" >&5
+$as_echo_n "checking for _mm256i CFLAGS=... " >&6; }
+if ${pgac_cv_avx2_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+__m256i vec = _mm256_set1_epi8(0);
+ __m256i cmp = _mm256_cmpeq_epi8(vec, vec);
+ return _mm256_movemask_epi8(cmp) > 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx2_intrinsics_=yes
+else
+ pgac_cv_avx2_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx2_intrinsics_" >&5
+$as_echo "$pgac_cv_avx2_intrinsics_" >&6; }
+if test x"$pgac_cv_avx2_intrinsics_" = x"yes"; then
+ CFLAGS_AVX2=""
+ pgac_avx2_intrinsics=yes
+fi
+
+if test x"pgac_avx2_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm256i CFLAGS=-mavx2" >&5
+$as_echo_n "checking for _mm256i CFLAGS=-mavx2... " >&6; }
+if ${pgac_cv_avx2_intrinsics__mavx2+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx2"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+__m256i vec = _mm256_set1_epi8(0);
+ __m256i cmp = _mm256_cmpeq_epi8(vec, vec);
+ return _mm256_movemask_epi8(cmp) > 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx2_intrinsics__mavx2=yes
+else
+ pgac_cv_avx2_intrinsics__mavx2=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx2_intrinsics__mavx2" >&5
+$as_echo "$pgac_cv_avx2_intrinsics__mavx2" >&6; }
+if test x"$pgac_cv_avx2_intrinsics__mavx2" = x"yes"; then
+ CFLAGS_AVX2="-mavx2"
+ pgac_avx2_intrinsics=yes
+fi
+
+fi
+
# Select semaphore implementation type.
if test "$PORTNAME" != "win32"; then
diff --git a/configure.ac b/configure.ac
index d093fb88dd..6b6d095306 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2300,6 +2300,12 @@ else
fi
AC_SUBST(PG_CRC32C_OBJS)
+# Check for Intel AVX2 intrinsics.
+PGAC_AVX2_INTRINSICS([])
+if test x"pgac_avx2_intrinsics" != x"yes"; then
+ PGAC_AVX2_INTRINSICS([-mavx2])
+fi
+AC_SUBST(CFLAGS_AVX2)
# Select semaphore implementation type.
if test "$PORTNAME" != "win32"; then
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 051718e4fe..9717094724 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -263,6 +263,7 @@ CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
CFLAGS_SSE42 = @CFLAGS_SSE42@
CFLAGS_ARMV8_CRC32C = @CFLAGS_ARMV8_CRC32C@
+CFLAGS_AVX2 = @CFLAGS_AVX2@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..5e4516ca90 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,10 @@ OBJS = \
integerset.o \
knapsack.o \
pairingheap.o \
+ radixtree.o \
rbtree.o \
+# radixtree.o need CFLAGS_AVX2
+radixtree.o: CFLAGS+=$(CFLAGS_AVX2)
+
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..bf87f932fd
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,1763 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
+ * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
+ * Also, there is no support for path compression and lazy path expansion. The
+ * radix tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * The key is a 64-bit unsigned integer and the value is a Datum. Both internal
+ * nodes and leaf nodes have the identical structure. For internal tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, also have the Datum value that is specified by the user.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * radix_tree_create - Create a new, empty radix tree
+ * radix_tree_free - Free the radix tree
+ * radix_tree_insert - Insert a key-value pair
+ * radix_tree_delete - Delete a key-value pair
+ * radix_tree_begin_iterate - Begin iterating through all key-value pairs
+ * radix_tree_iterate_next - Return next key-value pair, if any
+ * radix_tree_end_iterate - End iteration
+ *
+ * radix_tree_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * radix_tree_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "utils/memutils.h"
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+
+#if defined(__AVX2__)
+#include <immintrin.h> /* AVX2 intrinsics */
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RADIX_TREE_NODE_FANOUT 8
+
+/* The number of maximum slots in the node, used in node-256 */
+#define RADIX_TREE_NODE_MAX_SLOTS (1 << RADIX_TREE_NODE_FANOUT)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * in node-128 and node-256.
+ */
+#define RADIX_TREE_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RADIX_TREE_CHUNK_MASK ((1 << RADIX_TREE_NODE_FANOUT) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RADIX_TREE_MAX_SHIFT key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RADIX_TREE_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RADIX_TREE_NODE_FANOUT)
+
+/* Get a chunk from the key */
+#define GET_KEY_CHUNK(key, shift) \
+ ((uint8) (((key) >> (shift)) & RADIX_TREE_CHUNK_MASK))
+
+/* Mapping from the value to the bit in is-set bitmap in the node-128 and node-256 */
+#define NODE_BITMAP_BYTE(v) ((v) / RADIX_TREE_NODE_FANOUT)
+#define NODE_BITMAP_BIT(v) (UINT64_C(1) << ((v) % RADIX_TREE_NODE_FANOUT))
+
+/* Enum used radix_tree_node_search() */
+typedef enum
+{
+ RADIX_TREE_FIND = 0, /* find the key-value */
+ RADIX_TREE_DELETE, /* delete the key-value */
+} radix_tree_action;
+
+/*
+ * supported radix tree nodes.
+ *
+ * XXX: should we add KIND_16 as we can utilize SSE2 SIMD instructions?
+ */
+typedef enum radix_tree_node_kind
+{
+ RADIX_TREE_NODE_KIND_4 = 0,
+ RADIX_TREE_NODE_KIND_32,
+ RADIX_TREE_NODE_KIND_128,
+ RADIX_TREE_NODE_KIND_256
+} radix_tree_node_kind;
+#define RADIX_TREE_NODE_KIND_COUNT 4
+
+/*
+ * Base type for all nodes types.
+ */
+typedef struct radix_tree_node
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at ta fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RADIX_TREE_NODE_FANOUT bits are then represented in chunk.
+ */
+ uint8 shift;
+ uint8 chunk;
+
+ /* Size class of the node */
+ radix_tree_node_kind kind;
+} radix_tree_node;
+
+/* Macros for radix tree nodes */
+#define IS_LEAF_NODE(n) (((radix_tree_node *) (n))->shift == 0)
+#define IS_EMPTY_NODE(n) (((radix_tree_node *) (n))->count == 0)
+#define NODE_HAS_FREE_SLOT(n) \
+ (((radix_tree_node *) (n))->count < \
+ radix_tree_node_info[((radix_tree_node *) (n))->kind].max_slots)
+
+/*
+ * To reduce memory usage compared to a simple radix tree with a fixed fanout
+ * we use adaptive node sides, with different storage methods for different
+ * numbers of elements.
+ */
+typedef struct radix_tree_node_4
+{
+ radix_tree_node n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+ Datum slots[4];
+} radix_tree_node_4;
+
+typedef struct radix_tree_node_32
+{
+ radix_tree_node n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+ Datum slots[32];
+} radix_tree_node_32;
+
+#define RADIX_TREE_NODE_128_BITS RADIX_TREE_NODE_NSLOTS_BITS(128)
+typedef struct radix_tree_node_128
+{
+ radix_tree_node n;
+
+ /*
+ * The index of slots for each fanout. 0 means unused whereas slots is
+ * 0-indexed. So we can get the slot of the chunk C by slots[C] - 1.
+ */
+ uint8 slot_idxs[RADIX_TREE_NODE_MAX_SLOTS];
+
+ /* A bitmap to track which slot is in use */
+ uint8 isset[RADIX_TREE_NODE_128_BITS];
+
+ Datum slots[128];
+} radix_tree_node_128;
+
+#define RADIX_TREE_NODE_MAX_BITS RADIX_TREE_NODE_NSLOTS_BITS(RADIX_TREE_NODE_MAX_SLOTS)
+typedef struct radix_tree_node_256
+{
+ radix_tree_node n;
+
+ /* A bitmap to track which slot is in use */
+ uint8 isset[RADIX_TREE_NODE_MAX_BITS];
+
+ Datum slots[RADIX_TREE_NODE_MAX_SLOTS];
+} radix_tree_node_256;
+
+/* Information of each size class */
+typedef struct radix_tree_node_info_elem
+{
+ const char *name;
+ int max_slots;
+ Size size;
+} radix_tree_node_info_elem;
+
+static radix_tree_node_info_elem radix_tree_node_info[] =
+{
+ {"radix tree node 4", 4, sizeof(radix_tree_node_4)},
+ {"radix tree node 32", 32, sizeof(radix_tree_node_32)},
+ {"radix tree node 128", 128, sizeof(radix_tree_node_128)},
+ {"radix tree node 256", 256, sizeof(radix_tree_node_256)},
+};
+
+/*
+ * As we descend a radix tree, we push the node to the stack. The stack is used
+ * at deletion.
+ */
+typedef struct radix_tree_stack_data
+{
+ radix_tree_node *node;
+ struct radix_tree_stack_data *parent;
+} radix_tree_stack_data;
+typedef radix_tree_stack_data *radix_tree_stack;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending order
+ * of the key. To support this, the we iterate nodes of each level.
+ * radix_tree_iter_node_data struct is used to track the iteration within a node.
+ * radix_tree_iter has the array of this struct, stack, in order to track the iteration
+ * of every level. During the iteration, we also construct the key to return. The key
+ * is updated whenever we update the node iteration information, e.g., when advancing
+ * the current index within the node or when moving to the next node at the same level.
+ */
+typedef struct radix_tree_iter_node_data
+{
+ radix_tree_node *node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} radix_tree_iter_node_data;
+
+struct radix_tree_iter
+{
+ radix_tree *tree;
+
+ /* Track the iteration on nodes of each level */
+ radix_tree_iter_node_data stack[RADIX_TREE_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ radix_tree_node *root;
+ uint64 max_val;
+ uint64 num_keys;
+ MemoryContextData *slabs[RADIX_TREE_NODE_KIND_COUNT];
+
+ /* statistics */
+ uint64 mem_used;
+ int32 cnt[RADIX_TREE_NODE_KIND_COUNT];
+};
+
+static radix_tree_node *radix_tree_node_grow(radix_tree *tree, radix_tree_node *parent,
+ radix_tree_node *node, uint64 key);
+static bool radix_tree_node_search_child(radix_tree_node *node, radix_tree_node **child_p,
+ uint64 key);
+static bool radix_tree_node_search(radix_tree_node *node, Datum **slot_p, uint64 key,
+ radix_tree_action action);
+static void radix_tree_extend(radix_tree *tree, uint64 key);
+static void radix_tree_new_root(radix_tree *tree, uint64 key, Datum val);
+static radix_tree_node *radix_tree_node_insert_child(radix_tree *tree,
+ radix_tree_node *parent,
+ radix_tree_node *node,
+ uint64 key);
+static void radix_tree_node_insert_val(radix_tree *tree, radix_tree_node *parent,
+ radix_tree_node *node, uint64 key, Datum val,
+ bool *replaced_p);
+static inline void radix_tree_iter_update_key(radix_tree_iter *iter, uint8 chunk, uint8 shift);
+static Datum radix_tree_node_iterate_next(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+ bool *found_p);
+static void radix_tree_store_iter_node(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+ radix_tree_node *node);
+static void radix_tree_update_iter_stack(radix_tree_iter *iter, int from);
+static void radix_tree_verify_node(radix_tree_node *node);
+
+/*
+ * Helper functions for accessing each kind of nodes.
+ */
+static inline int
+node_32_search_eq(radix_tree_node_32 *node, uint8 chunk)
+{
+#ifdef __AVX2__
+ __m256i _key = _mm256_set1_epi8(chunk);
+ __m256i _data = _mm256_loadu_si256((__m256i_u *) node->chunks);
+ __m256i _cmp = _mm256_cmpeq_epi8(_key, _data);
+ uint32 bitfield = _mm256_movemask_epi8(_cmp);
+
+ bitfield &= ((UINT64_C(1) << node->n.count) - 1);
+
+ return (bitfield) ? __builtin_ctz(bitfield) : -1;
+
+#else
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] > chunk)
+ return -1;
+
+ if (node->chunks[i] == chunk)
+ return i;
+ }
+
+ return -1;
+#endif /* __AVX2__ */
+}
+
+/*
+ * This is a bit more complicated than search_chunk_array_16_eq(), because
+ * until recently no unsigned uint8 comparison instruction existed on x86. So
+ * we need to play some trickery using _mm_min_epu8() to effectively get
+ * <=. There never will be any equal elements in the current uses, but that's
+ * what we get here...
+ */
+static inline int
+node_32_search_le(radix_tree_node_32 *node, uint8 chunk)
+{
+#ifdef __AVX2__
+ __m256i _key = _mm256_set1_epi8(chunk);
+ __m256i _data = _mm256_loadu_si256((__m256i_u *) node->chunks);
+ __m256i _min = _mm256_min_epu8(_key, _data);
+ __m256i cmp = _mm256_cmpeq_epi8(_key, _min);
+ uint32_t bitfield = _mm256_movemask_epi8(cmp);
+
+ bitfield &= ((UINT64_C(1) << node->n.count) - 1);
+
+ return (bitfield) ? __builtin_ctz(bitfield) : node->n.count;
+#else
+ int index;
+
+ for (index = 0; index < node->n.count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+
+ return index;
+#endif /* __AVX2__ */
+}
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_128_is_chunk_used(radix_tree_node_128 *node, uint8 chunk)
+{
+ return (node->slot_idxs[chunk] != 0);
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_128_is_slot_used(radix_tree_node_128 *node, uint8 slot)
+{
+ return ((node->isset[NODE_BITMAP_BYTE(slot)] & NODE_BITMAP_BIT(slot)) != 0);
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_128_set(radix_tree_node_128 *node, uint8 chunk, Datum val)
+{
+ int slotpos = 0;
+
+ /* Search an unused slot */
+ while (node_128_is_slot_used(node, slotpos))
+ slotpos++;
+
+ node->slot_idxs[chunk] = slotpos + 1;
+ node->slots[slotpos] = val;
+ node->isset[NODE_BITMAP_BYTE(slotpos)] |= NODE_BITMAP_BIT(slotpos);
+}
+
+/* Delete the slot at the corresponding chunk */
+static inline void
+node_128_unset(radix_tree_node_128 *node, uint8 chunk)
+{
+ int slotpos = node->slot_idxs[chunk] - 1;
+
+ if (!node_128_is_chunk_used(node, chunk))
+ return;
+
+ node->isset[NODE_BITMAP_BYTE(slotpos)] &= ~(NODE_BITMAP_BIT(slotpos));
+ node->slot_idxs[chunk] = 0;
+}
+
+/* Return the slot data corresponding to the chunk */
+static inline Datum
+node_128_get_chunk_slot(radix_tree_node_128 *node, uint8 chunk)
+{
+ return node->slots[node->slot_idxs[chunk] - 1];
+}
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_256_is_chunk_used(radix_tree_node_256 *node, uint8 chunk)
+{
+ return (node->isset[NODE_BITMAP_BYTE(chunk)] & NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_256_set(radix_tree_node_256 *node, uint8 chunk, Datum slot)
+{
+ node->slots[chunk] = slot;
+ node->isset[NODE_BITMAP_BYTE(chunk)] |= NODE_BITMAP_BIT(chunk);
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_256_unset(radix_tree_node_256 *node, uint8 chunk)
+{
+ node->isset[NODE_BITMAP_BYTE(chunk)] &= ~(NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+inline static int
+key_get_shift(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RADIX_TREE_NODE_FANOUT) * RADIX_TREE_NODE_FANOUT;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+ if (shift == RADIX_TREE_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64_C(1) << (shift + RADIX_TREE_NODE_FANOUT)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static radix_tree_node *
+radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind)
+{
+ radix_tree_node *newnode;
+
+ newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind],
+ radix_tree_node_info[kind].size);
+ newnode->kind = kind;
+
+ /* update the statistics */
+ tree->mem_used += GetMemoryChunkSpace(newnode);
+ tree->cnt[kind]++;
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+radix_tree_free_node(radix_tree *tree, radix_tree_node *node)
+{
+ /*
+ * XXX: If we're deleting the root node, make the tree empty
+ */
+ if (tree->root == node)
+ {
+ tree->root = NULL;
+ }
+
+ /* update the statistics */
+ tree->mem_used -= GetMemoryChunkSpace(node);
+ tree->cnt[node->kind]--;
+
+ Assert(tree->mem_used >= 0);
+ Assert(tree->cnt[node->kind] >= 0);
+
+ pfree(node);
+}
+
+/* Free a stack made by radix_tree_delete */
+static void
+radix_tree_free_stack(radix_tree_stack stack)
+{
+ radix_tree_stack ostack;
+
+ while (stack != NULL)
+ {
+ ostack = stack;
+ stack = stack->parent;
+ pfree(ostack);
+ }
+}
+
+/* Copy the common fields without the kind */
+static void
+radix_tree_copy_node_common(radix_tree_node *src, radix_tree_node *dst)
+{
+ dst->shift = src->shift;
+ dst->chunk = src->chunk;
+ dst->count = src->count;
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+radix_tree_extend(radix_tree *tree, uint64 key)
+{
+ int max_shift;
+ int shift = tree->root->shift + RADIX_TREE_NODE_FANOUT;
+
+ max_shift = key_get_shift(key);
+
+ /* Grow tree from 'shift' to 'max_shift' */
+ while (shift <= max_shift)
+ {
+ radix_tree_node_4 *node =
+ (radix_tree_node_4 *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_4);
+
+ node->n.count = 1;
+ node->n.shift = shift;
+ node->chunks[0] = 0;
+ node->slots[0] = PointerGetDatum(tree->root);
+
+ tree->root->chunk = 0;
+ tree->root = (radix_tree_node *) node;
+
+ shift += RADIX_TREE_NODE_FANOUT;
+ }
+
+ tree->max_val = shift_get_max_val(max_shift);
+}
+
+/*
+ * Wrapper for radix_tree_node_search to search the pointer to the child node in the
+ * node.
+ *
+ * Return true if the corresponding child is found, otherwise return false. On success,
+ * it sets child_p.
+ */
+static bool
+radix_tree_node_search_child(radix_tree_node *node, radix_tree_node **child_p, uint64 key)
+{
+ bool found = false;
+ Datum *slot_ptr;
+
+ if (radix_tree_node_search(node, &slot_ptr, key, RADIX_TREE_FIND))
+ {
+ /* Found the pointer to the child node */
+ found = true;
+ *child_p = (radix_tree_node *) DatumGetPointer(*slot_ptr);
+ }
+
+ return found;
+}
+
+/*
+ * Return true if the corresponding slot is used, otherwise return false. On success,
+ * sets the pointer to the slot to slot_p.
+ */
+static bool
+radix_tree_node_search(radix_tree_node *node, Datum **slot_p, uint64 key,
+ radix_tree_action action)
+{
+ int chunk = GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+
+ switch (node->kind)
+ {
+ case RADIX_TREE_NODE_KIND_4:
+ {
+ radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+
+ /* Do linear search */
+ for (int i = 0; i < n4->n.count; i++)
+ {
+ if (n4->chunks[i] > chunk)
+ break;
+
+ /*
+ * If we find the chunk in the node, do the specified
+ * action
+ */
+ if (n4->chunks[i] == chunk)
+ {
+ if (action == RADIX_TREE_FIND)
+ *slot_p = &(n4->slots[i]);
+ else /* RADIX_TREE_DELETE */
+ {
+ memmove(&(n4->chunks[i]), &(n4->chunks[i + 1]),
+ sizeof(uint8) * (n4->n.count - i - 1));
+ memmove(&(n4->slots[i]), &(n4->slots[i + 1]),
+ sizeof(radix_tree_node *) * (n4->n.count - i - 1));
+ }
+
+ found = true;
+ break;
+ }
+ }
+
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_32:
+ {
+ radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+ int idx;
+
+ /* Search by SIMD instructions */
+ idx = node_32_search_eq(n32, chunk);
+
+ /* If we find the chunk in the node, do the specified action */
+ if (idx >= 0)
+ {
+ if (action == RADIX_TREE_FIND)
+ *slot_p = &(n32->slots[idx]);
+ else /* RADIX_TREE_DELETE */
+ {
+ memmove(&(n32->chunks[idx]), &(n32->chunks[idx + 1]),
+ sizeof(uint8) * (n32->n.count - idx - 1));
+ memmove(&(n32->slots[idx]), &(n32->slots[idx + 1]),
+ sizeof(radix_tree_node *) * (n32->n.count - idx - 1));
+ }
+
+ found = true;
+ }
+
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_128:
+ {
+ radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+
+ /* If we find the chunk in the node, do the specified action */
+ if (node_128_is_chunk_used(n128, chunk))
+ {
+ if (action == RADIX_TREE_FIND)
+ *slot_p = &(n128->slots[n128->slot_idxs[chunk] - 1]);
+ else /* RADIX_TREE_DELETE */
+ node_128_unset(n128, chunk);
+
+ found = true;
+ }
+
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_256:
+ {
+ radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+
+ /* If we find the chunk in the node, do the specified action */
+ if (node_256_is_chunk_used(n256, chunk))
+ {
+ if (action == RADIX_TREE_FIND)
+ *slot_p = &(n256->slots[chunk]);
+ else /* RADIX_TREE_DELETE */
+ node_256_unset(n256, chunk);
+
+ found = true;
+ }
+
+ break;
+ }
+ }
+
+ /* Update the statistics */
+ if (action == RADIX_TREE_DELETE && found)
+ node->count--;
+
+ return found;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+radix_tree_new_root(radix_tree *tree, uint64 key, Datum val)
+{
+ radix_tree_node_4 *n4 =
+ (radix_tree_node_4 *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_4);
+ int shift = key_get_shift(key);
+
+ n4->n.shift = shift;
+ tree->max_val = shift_get_max_val(shift);
+ tree->root = (radix_tree_node *) n4;
+}
+
+/* Insert 'node' as a child node of 'parent' */
+static radix_tree_node *
+radix_tree_node_insert_child(radix_tree *tree, radix_tree_node *parent,
+ radix_tree_node *node, uint64 key)
+{
+ radix_tree_node *newchild =
+ (radix_tree_node *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_4);
+
+ Assert(!IS_LEAF_NODE(node));
+
+ newchild->shift = node->shift - RADIX_TREE_NODE_FANOUT;
+ newchild->chunk = GET_KEY_CHUNK(key, node->shift);
+
+ radix_tree_node_insert_val(tree, parent, node, key, PointerGetDatum(newchild), NULL);
+
+ return (radix_tree_node *) newchild;
+}
+
+/*
+ * Insert the value to the node. The node grows if it's full.
+ */
+static void
+radix_tree_node_insert_val(radix_tree *tree, radix_tree_node *parent,
+ radix_tree_node *node, uint64 key, Datum val,
+ bool *replaced_p)
+{
+ int chunk = GET_KEY_CHUNK(key, node->shift);
+ bool replaced = false;
+
+ switch (node->kind)
+ {
+ case RADIX_TREE_NODE_KIND_4:
+ {
+ radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+ int idx;
+
+ for (idx = 0; idx < n4->n.count; idx++)
+ {
+ if (n4->chunks[idx] >= chunk)
+ break;
+ }
+
+ if (NODE_HAS_FREE_SLOT(n4))
+ {
+ if (n4->n.count == 0)
+ {
+ /* the first key for this node, add it */
+ }
+ else if (n4->chunks[idx] == chunk)
+ {
+ /* found the key, replace it */
+ replaced = true;
+ }
+ else if (idx != n4->n.count)
+ {
+ /*
+ * the key needs to be inserted in the middle of the
+ * array, make space for the new key.
+ */
+ memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]),
+ sizeof(uint8) * (n4->n.count - idx));
+ memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]),
+ sizeof(radix_tree_node *) * (n4->n.count - idx));
+ }
+
+ n4->chunks[idx] = chunk;
+ n4->slots[idx] = val;
+
+ /* Done */
+ break;
+ }
+
+ /* The node doesn't have free slot so needs to grow */
+ node = radix_tree_node_grow(tree, parent, node, key);
+ Assert(node->kind == RADIX_TREE_NODE_KIND_32);
+ }
+ /* FALLTHROUGH */
+ case RADIX_TREE_NODE_KIND_32:
+ {
+ radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+ int idx;
+
+ idx = node_32_search_le(n32, chunk);
+
+ if (NODE_HAS_FREE_SLOT(n32))
+ {
+ if (n32->n.count == 0)
+ {
+ /* first key for this node, add it */
+ }
+ else if (n32->chunks[idx] == chunk)
+ {
+ /* found the key, replace it */
+ replaced = true;
+ }
+ else if (idx != n32->n.count)
+ {
+ /*
+ * the key needs to be inserted in the middle of the
+ * array, make space for the new key.
+ */
+ memmove(&(n32->chunks[idx + 1]), &(n32->chunks[idx]),
+ sizeof(uint8) * (n32->n.count - idx));
+ memmove(&(n32->slots[idx + 1]), &(n32->slots[idx]),
+ sizeof(radix_tree_node *) * (n32->n.count - idx));
+ }
+
+ n32->chunks[idx] = chunk;
+ n32->slots[idx] = val;
+
+ /* Done */
+ break;
+ }
+
+ /* The node doesn't have free slot so needs to grow */
+ node = radix_tree_node_grow(tree, parent, node, key);
+ Assert(node->kind == RADIX_TREE_NODE_KIND_128);
+ }
+ /* FALLTHROUGH */
+ case RADIX_TREE_NODE_KIND_128:
+ {
+ radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+
+ if (node_128_is_chunk_used(n128, chunk))
+ {
+ /* found the existing value */
+ node_128_set(n128, chunk, val);
+ replaced = true;
+ break;
+ }
+
+ if (NODE_HAS_FREE_SLOT(n128))
+ {
+ node_128_set(n128, chunk, val);
+
+ /* Done */
+ break;
+ }
+
+ /* The node doesn't have free slot so needs to grow */
+ node = radix_tree_node_grow(tree, parent, node, key);
+ Assert(node->kind == RADIX_TREE_NODE_KIND_256);
+ }
+ /* FALLTHROUGH */
+ case RADIX_TREE_NODE_KIND_256:
+ {
+ radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+
+ if (node_256_is_chunk_used(n256, chunk))
+ replaced = true;
+
+ node_256_set(n256, chunk, val);
+
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!replaced)
+ node->count++;
+
+ if (replaced_p)
+ *replaced_p = replaced;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ radix_tree_verify_node(node);
+}
+
+/* Change the node type to the next larger one */
+static radix_tree_node *
+radix_tree_node_grow(radix_tree *tree, radix_tree_node *parent, radix_tree_node *node,
+ uint64 key)
+{
+ radix_tree_node *newnode = NULL;
+
+ Assert(node->count == radix_tree_node_info[node->kind].max_slots);
+
+ switch (node->kind)
+ {
+ case RADIX_TREE_NODE_KIND_4:
+ {
+ radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+ radix_tree_node_32 *new32 =
+ (radix_tree_node_32 *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_32);
+
+ radix_tree_copy_node_common((radix_tree_node *) n4,
+ (radix_tree_node *) new32);
+
+ /* Copy both chunks and slots to the new node */
+ memcpy(&(new32->chunks), &(n4->chunks), sizeof(uint8) * 4);
+ memcpy(&(new32->slots), &(n4->slots), sizeof(Datum) * 4);
+
+ newnode = (radix_tree_node *) new32;
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_32:
+ {
+ radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+ radix_tree_node_128 *new128 =
+ (radix_tree_node_128 *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_128);
+
+ /* Copy both chunks and slots to the new node */
+ radix_tree_copy_node_common((radix_tree_node *) n32,
+ (radix_tree_node *) new128);
+
+ for (int i = 0; i < n32->n.count; i++)
+ node_128_set(new128, n32->chunks[i], n32->slots[i]);
+
+ newnode = (radix_tree_node *) new128;
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_128:
+ {
+ radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+ radix_tree_node_256 *new256 =
+ (radix_tree_node_256 *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_256);
+ int cnt = 0;
+
+ radix_tree_copy_node_common((radix_tree_node *) n128,
+ (radix_tree_node *) new256);
+
+ for (int i = 0; i < RADIX_TREE_NODE_MAX_SLOTS && cnt < n128->n.count; i++)
+ {
+ if (!node_128_is_chunk_used(n128, i))
+ continue;
+
+ node_256_set(new256, i, node_128_get_chunk_slot(n128, i));
+ cnt++;
+ }
+
+ newnode = (radix_tree_node *) new256;
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_256:
+ elog(ERROR, "radix tree node-256 cannot grow");
+ break;
+ }
+
+ if (parent == node)
+ {
+ /* Replace the root node with the new large node */
+ tree->root = newnode;
+ }
+ else
+ {
+ Datum *slot_ptr = NULL;
+
+ /* Redirect from the parent to the node */
+ radix_tree_node_search(parent, &slot_ptr, key, RADIX_TREE_FIND);
+ Assert(*slot_ptr);
+ *slot_ptr = PointerGetDatum(newnode);
+ }
+
+ /* Verify the node has grown properly */
+ radix_tree_verify_node(newnode);
+
+ /* Free the old node */
+ radix_tree_free_node(tree, node);
+
+ return newnode;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+radix_tree_create(MemoryContext ctx)
+{
+ radix_tree *tree;
+ MemoryContext old_ctx;
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = palloc(sizeof(radix_tree));
+ tree->max_val = 0;
+ tree->root = NULL;
+ tree->context = ctx;
+ tree->num_keys = 0;
+ tree->mem_used = 0;
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RADIX_TREE_NODE_KIND_COUNT; i++)
+ {
+ tree->slabs[i] = SlabContextCreate(ctx,
+ radix_tree_node_info[i].name,
+ SLAB_DEFAULT_BLOCK_SIZE,
+ radix_tree_node_info[i].size);
+ tree->cnt[i] = 0;
+ }
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+radix_tree_free(radix_tree *tree)
+{
+ for (int i = 0; i < RADIX_TREE_NODE_KIND_COUNT; i++)
+ MemoryContextDelete(tree->slabs[i]);
+
+ pfree(tree);
+}
+
+/*
+ * Insert the key with the val.
+ *
+ * found_p is set to true if the key already present, otherwise false, if
+ * it's not NULL.
+ *
+ * XXX: do we need to support update_if_exists behavior?
+ */
+void
+radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
+{
+ int shift;
+ bool replaced;
+ radix_tree_node *node;
+ radix_tree_node *parent = tree->root;
+
+ /* Empty tree, create the root */
+ if (!tree->root)
+ radix_tree_new_root(tree, key, val);
+
+ /* Extend the tree if necessary */
+ if (key > tree->max_val)
+ radix_tree_extend(tree, key);
+
+ Assert(tree->root);
+
+ shift = tree->root->shift;
+ node = tree->root;
+ while (shift > 0)
+ {
+ radix_tree_node *child;
+
+ if (!radix_tree_node_search_child(node, &child, key))
+ child = radix_tree_node_insert_child(tree, parent, node, key);
+
+ Assert(child != NULL);
+
+ parent = node;
+ node = child;
+ shift -= RADIX_TREE_NODE_FANOUT;
+ }
+
+ /* arrived at a leaf */
+ Assert(IS_LEAF_NODE(node));
+
+ radix_tree_node_insert_val(tree, parent, node, key, val, &replaced);
+
+ /* Update the statistics */
+ if (!replaced)
+ tree->num_keys++;
+
+ if (found_p)
+ *found_p = replaced;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if the key is successfully
+ * found, otherwise return false. On success, we set the value to *val_p so
+ * it must not be NULL.
+ */
+bool
+radix_tree_search(radix_tree *tree, uint64 key, Datum *val_p)
+{
+ radix_tree_node *node;
+ Datum *value_ptr;
+ int shift;
+
+ Assert(val_p);
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift > 0)
+ {
+ radix_tree_node *child;
+
+ if (!radix_tree_node_search_child(node, &child, key))
+ return false;
+
+ node = child;
+ shift -= RADIX_TREE_NODE_FANOUT;
+ }
+
+ /* We reached at a leaf node, search the corresponding slot */
+ Assert(IS_LEAF_NODE(node));
+
+ if (!radix_tree_node_search(node, &value_ptr, key, RADIX_TREE_FIND))
+ return false;
+
+ /* Found, set the value to return */
+ *val_p = *value_ptr;
+ return true;
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+radix_tree_delete(radix_tree *tree, uint64 key)
+{
+ radix_tree_node *node;
+ int shift;
+ radix_tree_stack stack = NULL;
+ bool deleted;
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ /*
+ * Descending the tree to search the key while building a stack of nodes
+ * we visited.
+ */
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ radix_tree_node *child;
+ radix_tree_stack new_stack;
+
+ new_stack = (radix_tree_stack) palloc(sizeof(radix_tree_stack_data));
+ new_stack->node = node;
+ new_stack->parent = stack;
+ stack = new_stack;
+
+ if (IS_LEAF_NODE(node))
+ break;
+
+ if (!radix_tree_node_search_child(node, &child, key))
+ {
+ radix_tree_free_stack(stack);
+ return false;
+ }
+
+ node = child;
+ shift -= RADIX_TREE_NODE_FANOUT;
+ }
+
+ /*
+ * Delete the key from the leaf node and recursively delete internal nodes
+ * if necessary.
+ */
+ Assert(IS_LEAF_NODE(stack->node));
+ while (stack != NULL)
+ {
+ radix_tree_node *node;
+ Datum *slot;
+
+ /* pop the node from the stack */
+ node = stack->node;
+ stack = stack->parent;
+
+ deleted = radix_tree_node_search(node, &slot, key, RADIX_TREE_DELETE);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!IS_EMPTY_NODE(node))
+ break;
+
+ Assert(deleted);
+
+ /* The node became empty */
+ radix_tree_free_node(tree, node);
+
+ /*
+ * If we eventually deleted the root node while recursively deleting
+ * empty nodes, we make the tree empty.
+ */
+ if (stack == NULL)
+ {
+ tree->root = NULL;
+ tree->max_val = 0;
+ }
+ }
+
+ if (deleted)
+ tree->num_keys--;
+
+ radix_tree_free_stack(stack);
+ return deleted;
+}
+
+/* Create and return the iterator for the given radix tree */
+radix_tree_iter *
+radix_tree_begin_iterate(radix_tree *tree)
+{
+ MemoryContext old_ctx;
+ radix_tree_iter *iter;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (radix_tree_iter *) palloc0(sizeof(radix_tree_iter));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree)
+ return iter;
+
+ top_level = iter->tree->root->shift / RADIX_TREE_NODE_FANOUT;
+
+ iter->stack_len = top_level;
+ iter->stack[top_level].node = iter->tree->root;
+ iter->stack[top_level].current_idx = -1;
+
+ /* Descend to the left most leaf node from the root */
+ radix_tree_update_iter_stack(iter, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+bool
+radix_tree_iterate_next(radix_tree_iter *iter, uint64 *key_p, Datum *value_p)
+{
+ bool found = false;
+ Datum slot = (Datum) 0;
+ int level;
+
+ /* Empty tree */
+ if (!iter->tree)
+ return false;
+
+ for (;;)
+ {
+ radix_tree_node *node;
+ radix_tree_iter_node_data *node_iter;
+
+ /*
+ * Iterate node at each level from the bottom of the tree until we
+ * search the next slot.
+ */
+ for (level = 0; level <= iter->stack_len; level++)
+ {
+ slot = radix_tree_node_iterate_next(iter, &(iter->stack[level]), &found);
+
+ if (found)
+ break;
+ }
+
+ /* end of iteration */
+ if (!found)
+ return false;
+
+ /* found the next slot at the leaf node, return it */
+ if (level == 0)
+ {
+ *key_p = iter->key;
+ *value_p = slot;
+ return true;
+ }
+
+ /*
+ * We have advanced more than one nodes including internal nodes. So
+ * we need to update the stack by descending to the left most leaf
+ * node from this level.
+ */
+ node = (radix_tree_node *) DatumGetPointer(slot);
+ node_iter = &(iter->stack[level - 1]);
+ radix_tree_store_iter_node(iter, node_iter, node);
+
+ radix_tree_update_iter_stack(iter, level - 1);
+ }
+}
+
+void
+radix_tree_end_iterate(radix_tree_iter *iter)
+{
+ pfree(iter);
+}
+
+/*
+ * Update the part of the key being constructed during the iteration with the
+ * given chunk
+ */
+static inline void
+radix_tree_iter_update_key(radix_tree_iter *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RADIX_TREE_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Iterate over the given radix tree node and returns the next slot of the given
+ * node and set true to *found_p, if any. Otherwise, set false to *found_p.
+ */
+static Datum
+radix_tree_node_iterate_next(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+ bool *found_p)
+{
+ radix_tree_node *node = node_iter->node;
+ Datum slot = (Datum) 0;
+
+ switch (node->kind)
+ {
+ case RADIX_TREE_NODE_KIND_4:
+ {
+ radix_tree_node_4 *n4 = (radix_tree_node_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+
+ if (node_iter->current_idx >= n4->n.count)
+ goto not_found;
+
+ slot = n4->slots[node_iter->current_idx];
+
+ /* Update the part of the key with the current chunk */
+ if (IS_LEAF_NODE(node))
+ radix_tree_iter_update_key(iter, n4->chunks[node_iter->current_idx], 0);
+
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_32:
+ {
+ radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+
+ node_iter->current_idx++;
+
+ if (node_iter->current_idx >= n32->n.count)
+ goto not_found;
+
+ slot = n32->slots[node_iter->current_idx];
+
+ /* Update the part of the key with the current chunk */
+ if (IS_LEAF_NODE(node))
+ radix_tree_iter_update_key(iter, n32->chunks[node_iter->current_idx], 0);
+
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_128:
+ {
+ radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < 256; i++)
+ {
+ if (node_128_is_chunk_used(n128, i))
+ break;
+ }
+
+ if (i >= 256)
+ goto not_found;
+
+ node_iter->current_idx = i;
+ slot = node_128_get_chunk_slot(n128, i);
+
+ /* Update the part of the key */
+ if (IS_LEAF_NODE(node))
+ radix_tree_iter_update_key(iter, node_iter->current_idx, 0);
+
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_256:
+ {
+ radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < 256; i++)
+ {
+ if (node_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= 256)
+ goto not_found;
+
+ node_iter->current_idx = i;
+ slot = n256->slots[i];
+
+ /* Update the part of the key */
+ if (IS_LEAF_NODE(node))
+ radix_tree_iter_update_key(iter, node_iter->current_idx, 0);
+
+ break;
+ }
+ }
+
+ *found_p = true;
+ return slot;
+
+not_found:
+ *found_p = false;
+ return (Datum) 0;
+}
+
+/*
+ * Initialize and update the node iteration struct with the given radix tree node.
+ * This function also updates the part of the key with the chunk of the given node.
+ */
+static void
+radix_tree_store_iter_node(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+ radix_tree_node *node)
+{
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ radix_tree_iter_update_key(iter, node->chunk, node->shift + RADIX_TREE_NODE_FANOUT);
+}
+
+/*
+ * Build the stack of the radix tree node while descending to the leaf from the 'from'
+ * level.
+ */
+static void
+radix_tree_update_iter_stack(radix_tree_iter *iter, int from)
+{
+ radix_tree_node *node = iter->stack[from].node;
+ int level = from;
+
+ for (;;)
+ {
+ radix_tree_iter_node_data *node_iter = &(iter->stack[level--]);
+ bool found;
+
+ /* Set the current node */
+ radix_tree_store_iter_node(iter, node_iter, node);
+
+ if (IS_LEAF_NODE(node))
+ break;
+
+ node = (radix_tree_node *)
+ DatumGetPointer(radix_tree_node_iterate_next(iter, node_iter, &found));
+
+ /*
+ * Since we always get the first slot in the node, we have to found
+ * the slot.
+ */
+ Assert(found);
+ }
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+radix_tree_num_entries(radix_tree *tree)
+{
+ return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+radix_tree_memory_usage(radix_tree *tree)
+{
+ return tree->mem_used;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+radix_tree_verify_node(radix_tree_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RADIX_TREE_NODE_KIND_4:
+ {
+ radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+
+ /* Check if the chunks in the node are sorted */
+ for (int i = 1; i < n4->n.count; i++)
+ Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_32:
+ {
+ radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+
+ /* Check if the chunks in the node are sorted */
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_128:
+ {
+ radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RADIX_TREE_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_128_is_chunk_used(n128, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(node_128_is_slot_used(n128, n128->slot_idxs[i] - 1));
+
+ cnt++;
+ }
+
+ Assert(n128->n.count == cnt);
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_256:
+ {
+ radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RADIX_TREE_NODE_MAX_BITS; i++)
+ cnt += pg_popcount32(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->n.count == cnt);
+
+ break;
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RADIX_TREE_DEBUG
+void
+radix_tree_stats(radix_tree *tree)
+{
+ fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u(%lu), n32 = %u(%lu), n128 = %u(%lu), n256 = %u(%lu)",
+ tree->num_keys,
+ tree->root->shift / RADIX_TREE_NODE_FANOUT,
+ tree->cnt[0], tree->cnt[0] * sizeof(radix_tree_node_4),
+ tree->cnt[1], tree->cnt[1] * sizeof(radix_tree_node_32),
+ tree->cnt[2], tree->cnt[2] * sizeof(radix_tree_node_128),
+ tree->cnt[3], tree->cnt[3] * sizeof(radix_tree_node_256));
+ /* radix_tree_dump(tree); */
+}
+
+static void
+radix_tree_print_slot(StringInfo buf, uint8 chunk, Datum slot, int idx, bool is_leaf, int level)
+{
+ char space[128] = {0};
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ if (is_leaf)
+ appendStringInfo(buf, "%s[%d] \"0x%X\" val(0x%lX) LEAF\n",
+ space,
+ idx,
+ chunk,
+ DatumGetInt64(slot));
+ else
+ appendStringInfo(buf, "%s[%d] \"0x%X\" -> ",
+ space,
+ idx,
+ chunk);
+}
+
+static void
+radix_tree_dump_node(radix_tree_node *node, int level, StringInfo buf, bool recurse)
+{
+ bool is_leaf = IS_LEAF_NODE(node);
+
+ appendStringInfo(buf, "[\"%s\" type %d, cnt %u, shift %u, chunk \"0x%X\"] chunks:\n",
+ IS_LEAF_NODE(node) ? "LEAF" : "INNR",
+ (node->kind == RADIX_TREE_NODE_KIND_4) ? 4 :
+ (node->kind == RADIX_TREE_NODE_KIND_32) ? 32 :
+ (node->kind == RADIX_TREE_NODE_KIND_128) ? 128 : 256,
+ node->count, node->shift, node->chunk);
+
+ switch (node->kind)
+ {
+ case RADIX_TREE_NODE_KIND_4:
+ {
+ radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+
+ for (int i = 0; i < n4->n.count; i++)
+ {
+ radix_tree_print_slot(buf, n4->chunks[i], n4->slots[i], i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ radix_tree_dump_node((radix_tree_node *) n4->slots[i], level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_32:
+ {
+ radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+
+ for (int i = 0; i < n32->n.count; i++)
+ {
+ radix_tree_print_slot(buf, n32->chunks[i], n32->slots[i], i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ radix_tree_dump_node((radix_tree_node *) n32->slots[i], level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_128:
+ {
+ radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+
+ for (int j = 0; j < 256; j++)
+ {
+ if (!node_128_is_chunk_used(n128, j))
+ continue;
+
+ appendStringInfo(buf, "slot_idxs[%d]=%d, ", j, n128->slot_idxs[j]);
+ }
+ appendStringInfo(buf, "\nisset-bitmap:");
+ for (int j = 0; j < 16; j++)
+ {
+ appendStringInfo(buf, "%X ", (uint8) n128->isset[j]);
+ }
+ appendStringInfo(buf, "\n");
+
+ for (int i = 0; i < 256; i++)
+ {
+ if (!node_128_is_chunk_used(n128, i))
+ continue;
+
+ radix_tree_print_slot(buf, i, node_128_get_chunk_slot(n128, i),
+ i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ radix_tree_dump_node((radix_tree_node *) node_128_get_chunk_slot(n128, i),
+ level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ case RADIX_TREE_NODE_KIND_256:
+ {
+ radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+
+ for (int i = 0; i < 256; i++)
+ {
+ if (!node_256_is_chunk_used(n256, i))
+ continue;
+
+ radix_tree_print_slot(buf, i, n256->slots[i], i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ radix_tree_dump_node((radix_tree_node *) n256->slots[i], level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+radix_tree_dump_search(radix_tree *tree, uint64 key)
+{
+ StringInfoData buf;
+ radix_tree_node *node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+ if (!tree->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->max_val)
+ {
+ elog(NOTICE, "key %lu (0x%lX) is larger than max val",
+ key, key);
+ return;
+ }
+
+ initStringInfo(&buf);
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ radix_tree_node *child;
+
+ radix_tree_dump_node(node, level, &buf, false);
+
+ if (IS_LEAF_NODE(node))
+ {
+ Datum *dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ radix_tree_node_search(node, &dummy, key, RADIX_TREE_FIND);
+
+ break;
+ }
+
+ if (!radix_tree_node_search_child(node, &child, key))
+ break;
+
+ node = child;
+ shift -= RADIX_TREE_NODE_FANOUT;
+ level++;
+ }
+
+ elog(NOTICE, "\n%s", buf.data);
+}
+
+void
+radix_tree_dump(radix_tree *tree)
+{
+ StringInfoData buf;
+
+ initStringInfo(&buf);
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = %lu", tree->max_val);
+ radix_tree_dump_node(tree->root, 0, &buf, true);
+ elog(NOTICE, "\n%s", buf.data);
+ elog(NOTICE, "-----------------------------------------------------------");
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..7e864d124b
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+/* #define RADIX_TREE_DEBUG 1 */
+
+typedef struct radix_tree radix_tree;
+typedef struct radix_tree_iter radix_tree_iter;
+
+extern radix_tree *radix_tree_create(MemoryContext ctx);
+extern bool radix_tree_search(radix_tree *tree, uint64 key, Datum *val_p);
+extern void radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p);
+extern bool radix_tree_delete(radix_tree *tree, uint64 key);
+extern void radix_tree_free(radix_tree *tree);
+extern uint64 radix_tree_memory_usage(radix_tree *tree);
+extern uint64 radix_tree_num_entries(radix_tree *tree);
+
+extern radix_tree_iter *radix_tree_begin_iterate(radix_tree *tree);
+extern bool radix_tree_iterate_next(radix_tree_iter *iter, uint64 *key_p, Datum *value_p);
+extern void radix_tree_end_iterate(radix_tree_iter *iter);
+
+
+#ifdef RADIX_TREE_DEBUG
+extern void radix_tree_dump(radix_tree *tree);
+extern void radix_tree_dump_search(radix_tree *tree, uint64 key);
+extern void radix_tree_stats(radix_tree *tree);
+#endif
+
+#endif /* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9090226daa..51b2514faf 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -24,6 +24,7 @@ SUBDIRS = \
test_parser \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..cc6970c87c
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,28 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..6d5b06a800
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,502 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool radix_tree_test_stats = false;
+
+/* The maximum number of entries each node type can have */
+static int radix_tree_node_max_entries[] = {
+ 4, /* RADIX_TREE_NODE_KIND_4 */
+ 16, /* RADIX_TREE_NODE_KIND_16 */
+ 128, /* RADIX_TREE_NODE_KIND_128 */
+ 256 /* RADIX_TREE_NODE_KIND_256 */
+};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 10000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ radix_tree *radixtree;
+ Datum dummy;
+
+ radixtree = radix_tree_create(CurrentMemoryContext);
+
+ if (radix_tree_search(radixtree, 0, &dummy))
+ elog(ERROR, "radix_tree_search on empty tree returned true");
+
+ if (radix_tree_search(radixtree, 1, &dummy))
+ elog(ERROR, "radix_tree_search on empty tree returned true");
+
+ if (radix_tree_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "radix_tree_search on empty tree returned true");
+
+ if (radix_tree_num_entries(radixtree) != 0)
+ elog(ERROR, "radix_tree_num_entries on empty tree return non-zero");
+
+ radix_tree_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ Datum val;
+
+ if (!radix_tree_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (DatumGetUInt64(val) != key)
+ elog(ERROR, "radix_tree_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, DatumGetUInt64(val), key);
+ }
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ radix_tree_insert(radixtree, key, Int64GetDatum(key), &found);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+ for (int j = 0; j < lengthof(radix_tree_node_max_entries); j++)
+ {
+ /*
+ * After filling all slots in each node type, check if the values are
+ * stored properly.
+ */
+ if (i == (radix_tree_node_max_entries[j] - 1))
+ {
+ check_search_on_node(radixtree, shift,
+ (j == 0) ? 0 : radix_tree_node_max_entries[j - 1],
+ radix_tree_node_max_entries[j]);
+ break;
+ }
+ }
+ }
+
+ num_entries = radix_tree_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "radix_tree_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = radix_tree_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "inserted key 0x" UINT64_HEX_FORMAT " is not found", key);
+ }
+
+ num_entries = radix_tree_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "radix_tree_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ radix_tree *radixtree;
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+ radixtree = radix_tree_create(CurrentMemoryContext);
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search
+ * entries again.
+ */
+ test_node_types_insert(radixtree, shift);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift);
+
+ radix_tree_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec *spec)
+{
+ radix_tree *radixtree;
+ radix_tree_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (radix_tree_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+ radixtree = radix_tree_create(radixtree_ctx);
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ radix_tree_insert(radixtree, x, Int64GetDatum(x), &found);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (radix_tree_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by radix_tree_memory_usage(), as well as the
+ * stats from the memory context. They should be in the same ballpark,
+ * but it's hard to automate testing that, so if you're making changes to
+ * the implementation, just observe that manually.
+ */
+ if (radix_tree_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by radix_tree_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = radix_tree_memory_usage(radixtree);
+ fprintf(stderr, "radix_tree_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that radix_tree_num_entries works */
+ n = radix_tree_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "radix_tree_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with radix_tree_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ Datum v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to radix_tree_search() ? */
+ found = radix_tree_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (DatumGetUInt64(v) != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ DatumGetUInt64(v), x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (radix_tree_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = radix_tree_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!radix_tree_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ if (DatumGetUInt64(val) != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (radix_tree_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with radix_tree_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = radix_tree_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ uint64 x;
+ Datum v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to radix_tree_search() ? */
+ found = radix_tree_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!radix_tree_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (radix_tree_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (radix_tree_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (radix_tree_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = radix_tree_num_entries(radixtree);
+
+ /* Check that radix_tree_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "radix_tree_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
On Thu, Jun 16, 2022 at 11:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've attached an updated version patch that changes the configure
script. I'm still studying how to support AVX2 on msvc build. Also,
added more regression tests.
Thanks for the update, I will take a closer look at the patch in the
near future, possibly next week. For now, though, I'd like to question
why we even need to use 32-byte registers in the first place. For one,
the paper referenced has 16-pointer nodes, but none for 32 (next level
is 48 and uses a different method to find the index of the next
pointer). Andres' prototype has 32-pointer nodes, but in a quick read
of his patch a couple weeks ago I don't recall a reason mentioned for
it. Even if 32-pointer nodes are better from a memory perspective, I
imagine it should be possible to use two SSE2 registers to find the
index. It'd be locally slightly more complex, but not much. It might
not even cost much more in cycles since AVX2 would require indirecting
through a function pointer. It's much more convenient if we don't need
a runtime check. There are also thermal and power disadvantages when
using AXV2 in some workloads. I'm not sure that's the case here, but
if it is, we'd better be getting something in return.
One more thing in general: In an earlier version, I noticed that
Andres used the slab allocator and documented why. The last version of
your patch that I saw had the same allocator, but not the "why".
Especially in early stages of review, we want to document design
decisions so it's more clear for the reader.
--
John Naylor
EDB: http://www.enterprisedb.com
On 2022-06-16 Th 00:56, Masahiko Sawada wrote:
I've attached an updated version patch that changes the configure
script. I'm still studying how to support AVX2 on msvc build. Also,
added more regression tests.
I think you would need to add '/arch:AVX2' to the compiler flags in
MSBuildProject.pm.
See
<https://docs.microsoft.com/en-us/cpp/build/reference/arch-x64?view=msvc-170>
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Hi,
On Thu, Jun 16, 2022 at 4:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Thu, Jun 16, 2022 at 11:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've attached an updated version patch that changes the configure
script. I'm still studying how to support AVX2 on msvc build. Also,
added more regression tests.Thanks for the update, I will take a closer look at the patch in the
near future, possibly next week.
Thanks!
For now, though, I'd like to question
why we even need to use 32-byte registers in the first place. For one,
the paper referenced has 16-pointer nodes, but none for 32 (next level
is 48 and uses a different method to find the index of the next
pointer). Andres' prototype has 32-pointer nodes, but in a quick read
of his patch a couple weeks ago I don't recall a reason mentioned for
it.
I might be wrong but since AVX2 instruction set is introduced in
Haswell microarchitecture in 2013 and the referenced paper is
published in the same year, the art didn't use AVX2 instruction set.
32-pointer nodes are better from a memory perspective as you
mentioned. Andres' prototype supports both 16-pointer nodes and
32-pointer nodes (out of 6 node types). This would provide better
memory usage but on the other hand, it would also bring overhead of
switching the node type. Anyway, it's an important design decision to
support which size of node to support. It should be done based on
experiment results and documented.
Even if 32-pointer nodes are better from a memory perspective, I
imagine it should be possible to use two SSE2 registers to find the
index. It'd be locally slightly more complex, but not much. It might
not even cost much more in cycles since AVX2 would require indirecting
through a function pointer. It's much more convenient if we don't need
a runtime check.
Right.
There are also thermal and power disadvantages when
using AXV2 in some workloads. I'm not sure that's the case here, but
if it is, we'd better be getting something in return.
Good point.
One more thing in general: In an earlier version, I noticed that
Andres used the slab allocator and documented why. The last version of
your patch that I saw had the same allocator, but not the "why".
Especially in early stages of review, we want to document design
decisions so it's more clear for the reader.
Indeed. I'll add comments in the next version patch.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Mon, Jun 20, 2022 at 7:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
[v3 patch]
Hi Masahiko,
Since there are new files, and they are pretty large, I've attached
most specific review comments and questions as a diff rather than in
the email body. This is not a full review, which will take more time
-- this is a first pass mostly to aid my understanding, and discuss
some of the design and performance implications.
I tend to think it's a good idea to avoid most cosmetic review until
it's close to commit, but I did mention a couple things that might
enhance readability during review.
As I mentioned to you off-list, I have some thoughts on the nodes using SIMD:
On Thu, Jun 16, 2022 at 4:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:For now, though, I'd like to question
why we even need to use 32-byte registers in the first place. For one,
the paper referenced has 16-pointer nodes, but none for 32 (next level
is 48 and uses a different method to find the index of the next
pointer). Andres' prototype has 32-pointer nodes, but in a quick read
of his patch a couple weeks ago I don't recall a reason mentioned for
it.I might be wrong but since AVX2 instruction set is introduced in
Haswell microarchitecture in 2013 and the referenced paper is
published in the same year, the art didn't use AVX2 instruction set.
Sure, but with a bit of work the same technique could be done on that
node size with two 16-byte registers.
32-pointer nodes are better from a memory perspective as you
mentioned. Andres' prototype supports both 16-pointer nodes and
32-pointer nodes (out of 6 node types). This would provide better
memory usage but on the other hand, it would also bring overhead of
switching the node type.
Right, using more node types provides smaller increments of node size.
Just changing node type can be better or worse, depending on the
input.
Anyway, it's an important design decision to
support which size of node to support. It should be done based on
experiment results and documented.
Agreed. I would add that in the first step, we want something
straightforward to read and easy to integrate into our codebase. I
suspect other optimizations would be worth a lot more than using AVX2:
- collapsing inner nodes
- taking care when constructing the key (more on this when we
integrate with VACUUM)
...and a couple Andres mentioned:
- memory management: in
/messages/by-id/20210717194333.mr5io3zup3kxahfm@alap3.anarazel.de
- node dispatch:
/messages/by-id/20210728184139.qhvx6nbwdcvo63m6@alap3.anarazel.de
Therefore, I would suggest that we use SSE2 only, because:
- portability is very easy
- to avoid a performance hit from indirecting through a function pointer
When the PG16 cycle opens, I will work separately on ensuring the
portability of using SSE2, so you can focus on other aspects. I think
it would be a good idea to have both node16 and node32 for testing.
During benchmarking we can delete one or the other and play with the
other thresholds a bit.
Ideally, node16 and node32 would have the same code with a different
loop count (1 or 2). More generally, there is too much duplication of
code (noted by Andres in his PoC), and there are many variable names
with the node size embedded. This is a bit tricky to make more
general, so we don't need to try it yet, but ideally we would have
something similar to:
switch (node->kind) // todo: inspect tagged pointer
{
case RADIX_TREE_NODE_KIND_4:
idx = node_search_eq(node, chunk, 4);
do_action(node, idx, 4, ...);
break;
case RADIX_TREE_NODE_KIND_32:
idx = node_search_eq(node, chunk, 32);
do_action(node, idx, 32, ...);
...
}
static pg_alwaysinline void
node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
{
if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
// do simple loop with (node_simple *) node;
else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
// do vectorized loop where available with (node_vec *) node;
...
}
...and let the compiler do loop unrolling and branch removal. Not sure
how difficult this is to do, but something to think about.
Another thought: for non-x86 platforms, the SIMD nodes degenerate to
"simple loop", and looping over up to 32 elements is not great
(although possibly okay). We could do binary search, but that has bad
branch prediction.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v3-radix-review-diff-20220627.txttext/plain; charset=US-ASCII; name=v3-radix-review-diff-20220627.txtDownload
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index bf87f932fd..2bb04eba86 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -16,6 +16,11 @@
*
* The key is a 64-bit unsigned integer and the value is a Datum. Both internal
* nodes and leaf nodes have the identical structure. For internal tree nodes,
+It might worth mentioning:
+- the paper refers to this technique as "Multi-value leaves"
+- we chose it (I assume) for simplicity and to avoid an additional pointer traversal
+- it is the reason this code currently does not support variable-length keys.
+
* shift > 0, store the pointer to its child node as the value. The leaf nodes,
* shift == 0, also have the Datum value that is specified by the user.
*
@@ -24,6 +29,7 @@
* Interface
* ---------
*
+*_search belongs here too.
* radix_tree_create - Create a new, empty radix tree
* radix_tree_free - Free the radix tree
* radix_tree_insert - Insert a key-value pair
@@ -58,12 +64,18 @@
#include <immintrin.h> /* AVX2 intrinsics */
#endif
+// The name prefixes are a bit long, to shorten, maybe s/radix_tree_/rt_/ ?
+// ...and same for capitalized macros -> RT_
+
/* The number of bits encoded in one tree level */
+// terminology: this is not fanout, it's "span" -- ART has variable fanout (the different node types)
+// maybe BITS_PER_BYTE since the entire code assumes that chunks are byte-addressable
#define RADIX_TREE_NODE_FANOUT 8
/* The number of maximum slots in the node, used in node-256 */
#define RADIX_TREE_NODE_MAX_SLOTS (1 << RADIX_TREE_NODE_FANOUT)
+// maybe call them "nodes indexed by array lookups" -- the actual size is unimportant and could change
/*
* Return the number of bits required to represent nslots slots, used
* in node-128 and node-256.
@@ -84,7 +96,9 @@
((uint8) (((key) >> (shift)) & RADIX_TREE_CHUNK_MASK))
/* Mapping from the value to the bit in is-set bitmap in the node-128 and node-256 */
+// these macros assume we're addressing bytes, so maybe BITS_PER_BYTE instead of span (here referred to as fanout)?
#define NODE_BITMAP_BYTE(v) ((v) / RADIX_TREE_NODE_FANOUT)
+// Should this be UINT64CONST?
#define NODE_BITMAP_BIT(v) (UINT64_C(1) << ((v) % RADIX_TREE_NODE_FANOUT))
/* Enum used radix_tree_node_search() */
@@ -132,6 +146,7 @@ typedef struct radix_tree_node
} radix_tree_node;
/* Macros for radix tree nodes */
+// not sure why are we doing casts here?
#define IS_LEAF_NODE(n) (((radix_tree_node *) (n))->shift == 0)
#define IS_EMPTY_NODE(n) (((radix_tree_node *) (n))->count == 0)
#define NODE_HAS_FREE_SLOT(n) \
@@ -161,11 +176,14 @@ typedef struct radix_tree_node_32
Datum slots[32];
} radix_tree_node_32;
+// unnecessary symbol
#define RADIX_TREE_NODE_128_BITS RADIX_TREE_NODE_NSLOTS_BITS(128)
typedef struct radix_tree_node_128
{
radix_tree_node n;
+// maybe use 0xFF for INVALID_IDX ? then we can use 0-indexing
+// and if we do that, do we need isset? on creation, we can just memset slot_idx to INVALID_IDX
/*
* The index of slots for each fanout. 0 means unused whereas slots is
* 0-indexed. So we can get the slot of the chunk C by slots[C] - 1.
@@ -178,6 +196,7 @@ typedef struct radix_tree_node_128
Datum slots[128];
} radix_tree_node_128;
+// unnecessary symbol
#define RADIX_TREE_NODE_MAX_BITS RADIX_TREE_NODE_NSLOTS_BITS(RADIX_TREE_NODE_MAX_SLOTS)
typedef struct radix_tree_node_256
{
@@ -205,6 +224,7 @@ static radix_tree_node_info_elem radix_tree_node_info[] =
{"radix tree node 256", 256, sizeof(radix_tree_node_256)},
};
+// this comment is about a data structure, but talks about code somewhere else
/*
* As we descend a radix tree, we push the node to the stack. The stack is used
* at deletion.
@@ -262,6 +282,7 @@ struct radix_tree
static radix_tree_node *radix_tree_node_grow(radix_tree *tree, radix_tree_node *parent,
radix_tree_node *node, uint64 key);
+// maybe _node_find_child or _get_child because "search child" implies to me that we're searching within the child.
static bool radix_tree_node_search_child(radix_tree_node *node, radix_tree_node **child_p,
uint64 key);
static bool radix_tree_node_search(radix_tree_node *node, Datum **slot_p, uint64 key,
@@ -289,14 +310,19 @@ static void radix_tree_verify_node(radix_tree_node *node);
static inline int
node_32_search_eq(radix_tree_node_32 *node, uint8 chunk)
{
+// If we use SSE intrinsics on Windows, this code might be still be slow (see below),
+// so also guard with HAVE__BUILTIN_CTZ
#ifdef __AVX2__
__m256i _key = _mm256_set1_epi8(chunk);
__m256i _data = _mm256_loadu_si256((__m256i_u *) node->chunks);
__m256i _cmp = _mm256_cmpeq_epi8(_key, _data);
uint32 bitfield = _mm256_movemask_epi8(_cmp);
+// bitfield is uint32, so we don't need UINT64_C
bitfield &= ((UINT64_C(1) << node->n.count) - 1);
+// To make this portable, should be pg_rightmost_one_pos32().
+// Future TODO: This is slow on Windows, until will need to add the correct interfaces to pg_bitutils.h.
return (bitfield) ? __builtin_ctz(bitfield) : -1;
#else
@@ -313,6 +339,7 @@ node_32_search_eq(radix_tree_node_32 *node, uint8 chunk)
#endif /* __AVX2__ */
}
+// copy-paste error: search_chunk_array_16_eq
/*
* This is a bit more complicated than search_chunk_array_16_eq(), because
* until recently no unsigned uint8 comparison instruction existed on x86. So
@@ -346,6 +373,7 @@ node_32_search_le(radix_tree_node_32 *node, uint8 chunk)
#endif /* __AVX2__ */
}
+// see 0xFF idea above
/* Does the given chunk in the node has the value? */
static inline bool
node_128_is_chunk_used(radix_tree_node_128 *node, uint8 chunk)
@@ -367,6 +395,8 @@ node_128_set(radix_tree_node_128 *node, uint8 chunk, Datum val)
int slotpos = 0;
/* Search an unused slot */
+ // this could be slow - maybe iterate over the bytes and if the byte < 0xFF then check each bit
+ //
while (node_128_is_slot_used(node, slotpos))
slotpos++;
@@ -516,6 +546,7 @@ radix_tree_extend(radix_tree *tree, uint64 key)
max_shift = key_get_shift(key);
+ // why do we need the "max height" and not just one more?
/* Grow tree from 'shift' to 'max_shift' */
while (shift <= max_shift)
{
@@ -752,6 +783,7 @@ radix_tree_node_insert_val(radix_tree *tree, radix_tree_node *parent,
memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]),
sizeof(uint8) * (n4->n.count - idx));
memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]),
+ // sizeof(Datum) ?
sizeof(radix_tree_node *) * (n4->n.count - idx));
}
Another thought: for non-x86 platforms, the SIMD nodes degenerate to
"simple loop", and looping over up to 32 elements is not great
(although possibly okay). We could do binary search, but that has bad
branch prediction.
I am not sure that for relevant non-x86 platforms SIMD / vector
instructions would not be used (though it would be a good idea to
verify)
Do you know any modern platforms that do not have SIMD ?
I would definitely test before assuming binary search is better.
Often other approaches like counting search over such small vectors is
much better when the vector fits in cache (or even a cache line) and
you always visit all items as this will completely avoid branch
predictions and allows compiler to vectorize and / or unroll the loop
as needed.
Cheers
Hannu
Hi,
On 2022-06-27 18:12:13 +0700, John Naylor wrote:
Another thought: for non-x86 platforms, the SIMD nodes degenerate to
"simple loop", and looping over up to 32 elements is not great
(although possibly okay). We could do binary search, but that has bad
branch prediction.
I'd be quite quite surprised if binary search were cheaper. Particularly on
less fancy platforms.
- Andres
On Mon, Jun 27, 2022 at 10:23 PM Hannu Krosing <hannuk@google.com> wrote:
Another thought: for non-x86 platforms, the SIMD nodes degenerate to
"simple loop", and looping over up to 32 elements is not great
(although possibly okay). We could do binary search, but that has bad
branch prediction.I am not sure that for relevant non-x86 platforms SIMD / vector
instructions would not be used (though it would be a good idea to
verify)
By that logic, we can also dispense with intrinsics on x86 because the
compiler will autovectorize there too (if I understand your claim
correctly). I'm not quite convinced of that in this case.
I would definitely test before assuming binary search is better.
I wasn't very clear in my language, but I did reject binary search as
having bad branch prediction.
--
John Naylor
EDB: http://www.enterprisedb.com
Hi,
On 2022-06-28 11:17:42 +0700, John Naylor wrote:
On Mon, Jun 27, 2022 at 10:23 PM Hannu Krosing <hannuk@google.com> wrote:
Another thought: for non-x86 platforms, the SIMD nodes degenerate to
"simple loop", and looping over up to 32 elements is not great
(although possibly okay). We could do binary search, but that has bad
branch prediction.I am not sure that for relevant non-x86 platforms SIMD / vector
instructions would not be used (though it would be a good idea to
verify)By that logic, we can also dispense with intrinsics on x86 because the
compiler will autovectorize there too (if I understand your claim
correctly). I'm not quite convinced of that in this case.
Last time I checked (maybe a year ago?) none of the popular compilers could
autovectorize that code pattern.
Greetings,
Andres Freund
Hi,
On Mon, Jun 27, 2022 at 8:12 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Mon, Jun 20, 2022 at 7:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
[v3 patch]
Hi Masahiko,
Since there are new files, and they are pretty large, I've attached
most specific review comments and questions as a diff rather than in
the email body. This is not a full review, which will take more time
-- this is a first pass mostly to aid my understanding, and discuss
some of the design and performance implications.I tend to think it's a good idea to avoid most cosmetic review until
it's close to commit, but I did mention a couple things that might
enhance readability during review.
Thank you for reviewing the patch!
As I mentioned to you off-list, I have some thoughts on the nodes using SIMD:
On Thu, Jun 16, 2022 at 4:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:For now, though, I'd like to question
why we even need to use 32-byte registers in the first place. For one,
the paper referenced has 16-pointer nodes, but none for 32 (next level
is 48 and uses a different method to find the index of the next
pointer). Andres' prototype has 32-pointer nodes, but in a quick read
of his patch a couple weeks ago I don't recall a reason mentioned for
it.I might be wrong but since AVX2 instruction set is introduced in
Haswell microarchitecture in 2013 and the referenced paper is
published in the same year, the art didn't use AVX2 instruction set.Sure, but with a bit of work the same technique could be done on that
node size with two 16-byte registers.32-pointer nodes are better from a memory perspective as you
mentioned. Andres' prototype supports both 16-pointer nodes and
32-pointer nodes (out of 6 node types). This would provide better
memory usage but on the other hand, it would also bring overhead of
switching the node type.Right, using more node types provides smaller increments of node size.
Just changing node type can be better or worse, depending on the
input.Anyway, it's an important design decision to
support which size of node to support. It should be done based on
experiment results and documented.Agreed. I would add that in the first step, we want something
straightforward to read and easy to integrate into our codebase.
Agreed.
I
suspect other optimizations would be worth a lot more than using AVX2:
- collapsing inner nodes
- taking care when constructing the key (more on this when we
integrate with VACUUM)
...and a couple Andres mentioned:
- memory management: in
/messages/by-id/20210717194333.mr5io3zup3kxahfm@alap3.anarazel.de
- node dispatch:
/messages/by-id/20210728184139.qhvx6nbwdcvo63m6@alap3.anarazel.deTherefore, I would suggest that we use SSE2 only, because:
- portability is very easy
- to avoid a performance hit from indirecting through a function pointer
Okay, I'll try these optimizations and see if the performance becomes better.
When the PG16 cycle opens, I will work separately on ensuring the
portability of using SSE2, so you can focus on other aspects.
Thanks!
I think it would be a good idea to have both node16 and node32 for testing.
During benchmarking we can delete one or the other and play with the
other thresholds a bit.
I've done benchmark tests while changing the node types. The code base
is v3 patch that doesn't have the optimization you mentioned below
(memory management and node dispatch) but I added the code to use SSE2
for node-16 and node-32. The 'name' in the below result indicates the
kind of instruction set (AVX2 or SSE2) and the node type used. For
instance, sse2_4_32_48_256 means the radix tree has four kinds of node
types for each which have 4, 32, 48, and 256 pointers, respectively,
and use SSE2 instruction set.
* Case1 - Dense (simulating the case where there are 1000 consecutive
pages each of which has 100 dead tuples, at 100 page intervals.)
select prepare(
1000000, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within a page
1000, -- # of consecutive pages having dead tuples
1100 -- page interval
);
name size attach
lookup
avx2_4_32_128_256 1154 MB 6742.53 ms 47765.63 ms
avx2_4_32_48_256 1839 MB 4239.35 ms 40528.39 ms
sse2_4_16_128_256 1154 MB 6994.43 ms 40383.85 ms
sse2_4_16_32_128_256 1154 MB 7239.35 ms 43542.39 ms
sse2_4_16_48_256 1839 MB 4404.63 ms 36048.96 ms
sse2_4_32_128_256 1154 MB 6688.50 ms 44902.64 ms
* Case2 - Sparse (simulating a case where there are pages that have 2
dead tuples every 1000 pages.)
select prepare(
10000000, -- max block
2, -- # of dead tuples per page
50, -- dead tuples interval within a page
1, -- # of consecutive pages having dead tuples
1000 -- page interval
);
name size attach lookup
avx2_4_32_128_256 1535 kB 1.85 ms 17427.42 ms
avx2_4_32_48_256 1472 kB 2.01 ms 22176.75 ms
sse2_4_16_128_256 1582 kB 2.16 ms 15391.12 ms
sse2_4_16_32_128_256 1535 kB 2.14 ms 18757.86 ms
sse2_4_16_48_256 1489 kB 1.91 ms 19210.39 ms
sse2_4_32_128_256 1535 kB 2.05 ms 17777.55 ms
The statistics of the number of each node types are:
* avx2_4_32_128_256 (dense and sparse)
* nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31
* nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1
* avx2_4_32_48_256
* nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n48 = 227, n256 = 916433
* nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n48 = 159, n256 = 50
* sse2_4_16_128_256
* nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n128 = 916914, n256 = 31
* nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n128 = 256, n256 = 1
* sse2_4_16_32_128_256
* nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n32 = 285, n128 =
916629, n256 = 31
* nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n32 = 48, n128 =
208, n256 = 1
* sse2_4_16_48_256
* nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433
* nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n48 = 207, n256 = 50
* sse2_4_32_128_256
* nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31
* nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1
Observations are:
In both test cases, There is not much difference between using AVX2
and SSE2. The more mode types, the more time it takes for loading the
data (see sse2_4_16_32_128_256).
In dense case, since most nodes have around 100 children, the radix
tree that has node-128 had a good figure in terms of memory usage. On
the other hand, the radix tree that doesn't have node-128 has a better
number in terms of insertion performance. This is probably because we
need to iterate over 'isset' flags from the beginning of the array in
order to find an empty slot when inserting new data. We do the same
thing also for node-48 but it was better than node-128 as it's up to
48.
In terms of lookup performance, the results vary but I could not find
any common pattern that makes the performance better or worse. Getting
more statistics such as the number of each node type per tree level
might help me.
Ideally, node16 and node32 would have the same code with a different
loop count (1 or 2). More generally, there is too much duplication of
code (noted by Andres in his PoC), and there are many variable names
with the node size embedded. This is a bit tricky to make more
general, so we don't need to try it yet, but ideally we would have
something similar to:switch (node->kind) // todo: inspect tagged pointer
{
case RADIX_TREE_NODE_KIND_4:
idx = node_search_eq(node, chunk, 4);
do_action(node, idx, 4, ...);
break;
case RADIX_TREE_NODE_KIND_32:
idx = node_search_eq(node, chunk, 32);
do_action(node, idx, 32, ...);
...
}static pg_alwaysinline void
node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
{
if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
// do simple loop with (node_simple *) node;
else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
// do vectorized loop where available with (node_vec *) node;
...
}...and let the compiler do loop unrolling and branch removal. Not sure
how difficult this is to do, but something to think about.
Agreed.
I'll update my patch based on your review comments and use SSE2.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I
suspect other optimizations would be worth a lot more than using AVX2:
- collapsing inner nodes
- taking care when constructing the key (more on this when we
integrate with VACUUM)
...and a couple Andres mentioned:
- memory management: in
/messages/by-id/20210717194333.mr5io3zup3kxahfm@alap3.anarazel.de
- node dispatch:
/messages/by-id/20210728184139.qhvx6nbwdcvo63m6@alap3.anarazel.deTherefore, I would suggest that we use SSE2 only, because:
- portability is very easy
- to avoid a performance hit from indirecting through a function pointerOkay, I'll try these optimizations and see if the performance becomes better.
FWIW, I think it's fine if we delay these until after committing a
good-enough version. The exception is key construction and I think
that deserves some attention now (more on this below).
I've done benchmark tests while changing the node types. The code base
is v3 patch that doesn't have the optimization you mentioned below
(memory management and node dispatch) but I added the code to use SSE2
for node-16 and node-32.
Great, this is helpful to visualize what's going on!
* sse2_4_16_48_256
* nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433
* nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n48 = 207, n256 = 50* sse2_4_32_128_256
* nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31
* nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1
Observations are:
In both test cases, There is not much difference between using AVX2
and SSE2. The more mode types, the more time it takes for loading the
data (see sse2_4_16_32_128_256).
Good to know. And as Andres mentioned in his PoC, more node types
would be a barrier for pointer tagging, since 32-bit platforms only
have two spare bits in the pointer.
In dense case, since most nodes have around 100 children, the radix
tree that has node-128 had a good figure in terms of memory usage. On
Looking at the node stats, and then your benchmark code, I think key
construction is a major influence, maybe more than node type. The
key/value scheme tested now makes sense:
blockhi || blocklo || 9 bits of item offset
(with the leaf nodes containing a bit map of the lowest few bits of
this whole thing)
We want the lower fanout nodes at the top of the tree and higher
fanout ones at the bottom.
Note some consequences: If the table has enough columns such that much
fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
dense case the nodes above the leaves will have lower fanout (maybe
they will fit in a node32). Also, the bitmap values in the leaves will
be more empty. In other words, many tables in the wild *resemble* the
sparse case a bit, even if truly all tuples on the page are dead.
Note also that the dense case in the benchmark above has ~4500 times
more keys than the sparse case, and uses about ~1000 times more
memory. But the runtime is only 2-3 times longer. That's interesting
to me.
To optimize for the sparse case, it seems to me that the key/value would be
blockhi || 9 bits of item offset || blocklo
I believe that would make the leaf nodes more dense, with fewer inner
nodes, and could drastically speed up the sparse case, and maybe many
realistic dense cases. I'm curious to hear your thoughts.
the other hand, the radix tree that doesn't have node-128 has a better
number in terms of insertion performance. This is probably because we
need to iterate over 'isset' flags from the beginning of the array in
order to find an empty slot when inserting new data. We do the same
thing also for node-48 but it was better than node-128 as it's up to
48.
I mentioned in my diff, but for those following along, I think we can
improve that by iterating over the bytes and if it's 0xFF all 8 bits
are set already so keep looking...
In terms of lookup performance, the results vary but I could not find
any common pattern that makes the performance better or worse. Getting
more statistics such as the number of each node type per tree level
might help me.
I think that's a sign that the choice of node types might not be
terribly important for these two cases. That's good if that's true in
general -- a future performance-critical use of this code might tweak
things for itself without upsetting vacuum.
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Jun 28, 2022 at 10:10 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I
suspect other optimizations would be worth a lot more than using AVX2:
- collapsing inner nodes
- taking care when constructing the key (more on this when we
integrate with VACUUM)
...and a couple Andres mentioned:
- memory management: in
/messages/by-id/20210717194333.mr5io3zup3kxahfm@alap3.anarazel.de
- node dispatch:
/messages/by-id/20210728184139.qhvx6nbwdcvo63m6@alap3.anarazel.deTherefore, I would suggest that we use SSE2 only, because:
- portability is very easy
- to avoid a performance hit from indirecting through a function pointerOkay, I'll try these optimizations and see if the performance becomes better.
FWIW, I think it's fine if we delay these until after committing a
good-enough version. The exception is key construction and I think
that deserves some attention now (more on this below).
Agreed.
I've done benchmark tests while changing the node types. The code base
is v3 patch that doesn't have the optimization you mentioned below
(memory management and node dispatch) but I added the code to use SSE2
for node-16 and node-32.Great, this is helpful to visualize what's going on!
* sse2_4_16_48_256
* nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433
* nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n48 = 207, n256 = 50* sse2_4_32_128_256
* nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31
* nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1Observations are:
In both test cases, There is not much difference between using AVX2
and SSE2. The more mode types, the more time it takes for loading the
data (see sse2_4_16_32_128_256).Good to know. And as Andres mentioned in his PoC, more node types
would be a barrier for pointer tagging, since 32-bit platforms only
have two spare bits in the pointer.In dense case, since most nodes have around 100 children, the radix
tree that has node-128 had a good figure in terms of memory usage. OnLooking at the node stats, and then your benchmark code, I think key
construction is a major influence, maybe more than node type. The
key/value scheme tested now makes sense:blockhi || blocklo || 9 bits of item offset
(with the leaf nodes containing a bit map of the lowest few bits of
this whole thing)We want the lower fanout nodes at the top of the tree and higher
fanout ones at the bottom.
So more inner nodes can fit in CPU cache, right?
Note some consequences: If the table has enough columns such that much
fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
dense case the nodes above the leaves will have lower fanout (maybe
they will fit in a node32). Also, the bitmap values in the leaves will
be more empty. In other words, many tables in the wild *resemble* the
sparse case a bit, even if truly all tuples on the page are dead.Note also that the dense case in the benchmark above has ~4500 times
more keys than the sparse case, and uses about ~1000 times more
memory. But the runtime is only 2-3 times longer. That's interesting
to me.To optimize for the sparse case, it seems to me that the key/value would be
blockhi || 9 bits of item offset || blocklo
I believe that would make the leaf nodes more dense, with fewer inner
nodes, and could drastically speed up the sparse case, and maybe many
realistic dense cases.
Does it have an effect on the number of inner nodes?
I'm curious to hear your thoughts.
Thank you for your analysis. It's worth trying. We use 9 bits for item
offset but most pages don't use all bits in practice. So probably it
might be better to move the most significant bit of item offset to the
left of blockhi. Or more simply:
9 bits of item offset || blockhi || blocklo
the other hand, the radix tree that doesn't have node-128 has a better
number in terms of insertion performance. This is probably because we
need to iterate over 'isset' flags from the beginning of the array in
order to find an empty slot when inserting new data. We do the same
thing also for node-48 but it was better than node-128 as it's up to
48.I mentioned in my diff, but for those following along, I think we can
improve that by iterating over the bytes and if it's 0xFF all 8 bits
are set already so keep looking...
Right. Using 0xFF also makes the code readable so I'll change that.
In terms of lookup performance, the results vary but I could not find
any common pattern that makes the performance better or worse. Getting
more statistics such as the number of each node type per tree level
might help me.I think that's a sign that the choice of node types might not be
terribly important for these two cases. That's good if that's true in
general -- a future performance-critical use of this code might tweak
things for itself without upsetting vacuum.
Agreed.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Hi,
I just noticed that I had a reply forgotten in drafts...
On 2022-05-10 10:51:46 +0900, Masahiko Sawada wrote:
To move this project forward, I've implemented radix tree
implementation from scratch while studying Andres's implementation. It
supports insertion, search, and iteration but not deletion yet. In my
implementation, I use Datum as the value so internal and lead nodes
have the same data structure, simplifying the implementation. The
iteration on the radix tree returns keys with the value in ascending
order of the key. The patch has regression tests for radix tree but is
still in PoC state: left many debugging codes, not supported SSE2 SIMD
instructions, added -mavx2 flag is hard-coded.
Very cool - thanks for picking this up.
Greetings,
Andres Freund
Hi,
On 2022-06-16 13:56:55 +0900, Masahiko Sawada wrote:
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c new file mode 100644 index 0000000000..bf87f932fd --- /dev/null +++ b/src/backend/lib/radixtree.c @@ -0,0 +1,1763 @@ +/*------------------------------------------------------------------------- + * + * radixtree.c + * Implementation for adaptive radix tree. + * + * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful + * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas + * Neumann, 2013. + * + * There are some differences from the proposed implementation. For instance, + * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit + * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper. + * Also, there is no support for path compression and lazy path expansion. The + * radix tree supports fixed length of the key so we don't expect the tree level + * wouldn't be high.
I think we're going to need path compression at some point, fwiw. I'd bet on
it being beneficial even for the tid case.
+ * The key is a 64-bit unsigned integer and the value is a Datum.
I don't think it's a good idea to define the value type to be a datum.
+/* + * As we descend a radix tree, we push the node to the stack. The stack is used + * at deletion. + */ +typedef struct radix_tree_stack_data +{ + radix_tree_node *node; + struct radix_tree_stack_data *parent; +} radix_tree_stack_data; +typedef radix_tree_stack_data *radix_tree_stack;
I think it's a very bad idea for traversal to need allocations. I really want
to eventually use this for shared structures (eventually with lock-free
searches at least), and needing to do allocations while traversing the tree is
a no-go for that.
Particularly given that the tree currently has a fixed depth, can't you just
allocate this on the stack once?
+/* + * Allocate a new node with the given node kind. + */ +static radix_tree_node * +radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind) +{ + radix_tree_node *newnode; + + newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind], + radix_tree_node_info[kind].size); + newnode->kind = kind; + + /* update the statistics */ + tree->mem_used += GetMemoryChunkSpace(newnode); + tree->cnt[kind]++; + + return newnode; +}
Why are you tracking the memory usage at this level of detail? It's *much*
cheaper to track memory usage via the memory contexts? Since they're dedicated
for the radix tree, that ought to be sufficient?
+ else if (idx != n4->n.count) + { + /* + * the key needs to be inserted in the middle of the + * array, make space for the new key. + */ + memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]), + sizeof(uint8) * (n4->n.count - idx)); + memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]), + sizeof(radix_tree_node *) * (n4->n.count - idx)); + }
Maybe we could add a static inline helper for these memmoves? Both because
it's repetitive (for different node types) and because the last time I looked
gcc was generating quite bad code for this. And having to put workarounds into
multiple places is obviously worse than having to do it in one place.
+/* + * Insert the key with the val. + * + * found_p is set to true if the key already present, otherwise false, if + * it's not NULL. + * + * XXX: do we need to support update_if_exists behavior? + */
Yes, I think that's needed - hence using bfm_set() instead of insert() in the
prototype.
+void +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p) +{ + int shift; + bool replaced; + radix_tree_node *node; + radix_tree_node *parent = tree->root; + + /* Empty tree, create the root */ + if (!tree->root) + radix_tree_new_root(tree, key, val); + + /* Extend the tree if necessary */ + if (key > tree->max_val) + radix_tree_extend(tree, key);
FWIW, the reason I used separate functions for these in the prototype is that
it turns out to generate a lot better code, because it allows non-inlined
function calls to be sibling calls - thereby avoiding the need for a dedicated
stack frame. That's not possible once you need a palloc or such, so splitting
off those call paths into dedicated functions is useful.
Greetings,
Andres Freund
Hi,
On 2022-06-28 15:24:11 +0900, Masahiko Sawada wrote:
In both test cases, There is not much difference between using AVX2
and SSE2. The more mode types, the more time it takes for loading the
data (see sse2_4_16_32_128_256).
Yea, at some point the compiler starts using a jump table instead of branches,
and that turns out to be a good bit more expensive. And even with branches, it
obviously adds hard to predict branches. IIRC I fought a bit with the compiler
to avoid some of that cost, it's possible that got "lost" in Sawada-san's
patch.
Sawada-san, what led you to discard the 1 and 16 node types? IIRC the 1 node
one is not unimportant until we have path compression.
Right now the node struct sizes are:
4 - 48 bytes
32 - 296 bytes
128 - 1304 bytes
256 - 2088 bytes
I guess radix_tree_node_128->isset is just 16 bytes compared to 1288 other
bytes, but needing that separate isset array somehow is sad :/. I wonder if a
smaller "free index" would do the trick? Point to the element + 1 where we
searched last and start a plain loop there. Particularly in an insert-only
workload that'll always work, and in other cases it'll still often work I
think.
One thing I was wondering about is trying to choose node types in
roughly-power-of-two struct sizes. It's pretty easy to end up with significant
fragmentation in the slabs right now when inserting as you go, because some of
the smaller node types will be freed but not enough to actually free blocks of
memory. If we instead have ~power-of-two sizes we could just use a single slab
of the max size, and carve out the smaller node types out of that largest
allocation.
Btw, that fragmentation is another reason why I think it's better to track
memory usage via memory contexts, rather than doing so based on
GetMemoryChunkSpace().
Ideally, node16 and node32 would have the same code with a different
loop count (1 or 2). More generally, there is too much duplication of
code (noted by Andres in his PoC), and there are many variable names
with the node size embedded. This is a bit tricky to make more
general, so we don't need to try it yet, but ideally we would have
something similar to:switch (node->kind) // todo: inspect tagged pointer
{
case RADIX_TREE_NODE_KIND_4:
idx = node_search_eq(node, chunk, 4);
do_action(node, idx, 4, ...);
break;
case RADIX_TREE_NODE_KIND_32:
idx = node_search_eq(node, chunk, 32);
do_action(node, idx, 32, ...);
...
}
FWIW, that should be doable with an inline function, if you pass it the memory
to the "array" rather than the node directly. Not so sure it's a good idea to
do dispatch between node types / search methods inside the helper, as you
suggest below:
static pg_alwaysinline void
node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
{
if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
// do simple loop with (node_simple *) node;
else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
// do vectorized loop where available with (node_vec *) node;
...
}
Greetings,
Andres Freund
On Mon, Jul 4, 2022 at 2:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Jun 28, 2022 at 10:10 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I
suspect other optimizations would be worth a lot more than using AVX2:
- collapsing inner nodes
- taking care when constructing the key (more on this when we
integrate with VACUUM)
...and a couple Andres mentioned:
- memory management: in
/messages/by-id/20210717194333.mr5io3zup3kxahfm@alap3.anarazel.de
- node dispatch:
/messages/by-id/20210728184139.qhvx6nbwdcvo63m6@alap3.anarazel.deTherefore, I would suggest that we use SSE2 only, because:
- portability is very easy
- to avoid a performance hit from indirecting through a function pointerOkay, I'll try these optimizations and see if the performance becomes better.
FWIW, I think it's fine if we delay these until after committing a
good-enough version. The exception is key construction and I think
that deserves some attention now (more on this below).Agreed.
I've done benchmark tests while changing the node types. The code base
is v3 patch that doesn't have the optimization you mentioned below
(memory management and node dispatch) but I added the code to use SSE2
for node-16 and node-32.Great, this is helpful to visualize what's going on!
* sse2_4_16_48_256
* nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433
* nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n48 = 207, n256 = 50* sse2_4_32_128_256
* nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31
* nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1Observations are:
In both test cases, There is not much difference between using AVX2
and SSE2. The more mode types, the more time it takes for loading the
data (see sse2_4_16_32_128_256).Good to know. And as Andres mentioned in his PoC, more node types
would be a barrier for pointer tagging, since 32-bit platforms only
have two spare bits in the pointer.In dense case, since most nodes have around 100 children, the radix
tree that has node-128 had a good figure in terms of memory usage. OnLooking at the node stats, and then your benchmark code, I think key
construction is a major influence, maybe more than node type. The
key/value scheme tested now makes sense:blockhi || blocklo || 9 bits of item offset
(with the leaf nodes containing a bit map of the lowest few bits of
this whole thing)We want the lower fanout nodes at the top of the tree and higher
fanout ones at the bottom.So more inner nodes can fit in CPU cache, right?
Note some consequences: If the table has enough columns such that much
fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
dense case the nodes above the leaves will have lower fanout (maybe
they will fit in a node32). Also, the bitmap values in the leaves will
be more empty. In other words, many tables in the wild *resemble* the
sparse case a bit, even if truly all tuples on the page are dead.Note also that the dense case in the benchmark above has ~4500 times
more keys than the sparse case, and uses about ~1000 times more
memory. But the runtime is only 2-3 times longer. That's interesting
to me.To optimize for the sparse case, it seems to me that the key/value would be
blockhi || 9 bits of item offset || blocklo
I believe that would make the leaf nodes more dense, with fewer inner
nodes, and could drastically speed up the sparse case, and maybe many
realistic dense cases.Does it have an effect on the number of inner nodes?
I'm curious to hear your thoughts.
Thank you for your analysis. It's worth trying. We use 9 bits for item
offset but most pages don't use all bits in practice. So probably it
might be better to move the most significant bit of item offset to the
left of blockhi. Or more simply:9 bits of item offset || blockhi || blocklo
the other hand, the radix tree that doesn't have node-128 has a better
number in terms of insertion performance. This is probably because we
need to iterate over 'isset' flags from the beginning of the array in
order to find an empty slot when inserting new data. We do the same
thing also for node-48 but it was better than node-128 as it's up to
48.I mentioned in my diff, but for those following along, I think we can
improve that by iterating over the bytes and if it's 0xFF all 8 bits
are set already so keep looking...Right. Using 0xFF also makes the code readable so I'll change that.
In terms of lookup performance, the results vary but I could not find
any common pattern that makes the performance better or worse. Getting
more statistics such as the number of each node type per tree level
might help me.I think that's a sign that the choice of node types might not be
terribly important for these two cases. That's good if that's true in
general -- a future performance-critical use of this code might tweak
things for itself without upsetting vacuum.Agreed.
I've attached an updated patch that incorporated comments from John.
Here are some comments I could not address and the reason:
+// bitfield is uint32, so we don't need UINT64_C
bitfield &= ((UINT64_C(1) << node->n.count) - 1);
Since node->n.count could be 32, I think we need to use UINT64CONST() here.
/* Macros for radix tree nodes */
+// not sure why are we doing casts here?
#define IS_LEAF_NODE(n) (((radix_tree_node *) (n))->shift == 0)
#define IS_EMPTY_NODE(n) (((radix_tree_node *) (n))->count == 0)
I've left the casts as I use IS_LEAF_NODE for rt_node_4/16/32/128/256.
Also, I've dropped the configure script support for AVX2, and support
for SSE2 is missing. I'll update it later.
I've not addressed the comments I got from Andres yet so I'll update
the patch according to the discussion but the current patch would be
more readable than the previous one thanks to the comments from John.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Attachments:
radixtree_wip_v4.patchapplication/octet-stream; name=radixtree_wip_v4.patchDownload
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..ead0755d25 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,9 @@ OBJS = \
integerset.o \
knapsack.o \
pairingheap.o \
+ radixtree.o \
rbtree.o \
+radixtree.o: CFLAGS+=-msse2
+
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..f1118679d6
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2040 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
+ * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
+ * Also, there is no support for path compression and lazy path expansion. The
+ * radix tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * The key is a 64-bit unsigned integer and the value is a Datum. Both internal
+ * nodes and leaf nodes have the identical structure. For internal tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, also have the Datum value that is specified by the user. The
+ * paper refers to this technique as "Multi-value leaves". We choose it for
+ * simplicity and to avoid an additional pointer traversal. It is the reason
+ * this code currently does not support variable-length keys.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create - Create a new, empty radix tree
+ * rt_free - Free the radix tree
+ * rt_search - Search a key-value pair
+ * rt_insert - Insert a key-value pair
+ * rt_delete - Delete a key-value pair
+ * rt_begin_iterate - Begin iterating through all key-value pairs
+ * rt_iterate_next - Return next key-value pair, if any
+ * rt_end_iterate - End iteration
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "utils/memutils.h"
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+
+#if defined(__SSE2__)
+#include <emmintrin.h> /* SSE2 intrinsics */
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) \
+ ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-128
+ * and node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+ RT_ACTION_FIND = 0, /* find the key-value */
+ RT_ACTION_DELETE, /* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree nodes.
+ *
+ * XXX: These are currently not well chosen. To reduce memory fragmentation
+ * smaller class should optimally fit neatly into the next larger class
+ * (except perhaps at the lowest end). Right now its
+ * 48 -> 152 -> 296 -> 1304 -> 2088 bytes for inner/leaf nodes, leading to
+ * large amounts of allocator padding with aset.c. Hence the use of slab.
+ *
+ * XXX: need to explain why we choose these node types based on benchmark
+ * results etc.
+ */
+typedef enum rt_node_kind
+{
+ RT_NODE_KIND_4 = 0,
+ RT_NODE_KIND_16,
+ RT_NODE_KIND_32,
+ RT_NODE_KIND_128,
+ RT_NODE_KIND_256
+} rt_node_kind;
+#define RT_NODE_KIND_COUNT 5
+
+/*
+ * Base type for all nodes types.
+ */
+typedef struct rt_node
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+ uint8 chunk;
+
+ /* Size class of the node */
+ rt_node_kind kind;
+} rt_node;
+
+/* Macros for radix tree nodes */
+#define IS_LEAF_NODE(n) (((rt_node *) (n))->shift == 0)
+#define IS_EMPTY_NODE(n) (((rt_node *) (n))->count == 0)
+#define NODE_HAS_FREE_SLOT(n) \
+ (((rt_node *) (n))->count < rt_node_info[((rt_node *) (n))->kind].max_slots)
+
+/*
+ * To reduce memory usage compared to a simple radix tree with a fixed
+ * fanout we use adaptive node sides, with different storage methods
+ * for different numbers of elements.
+ */
+typedef struct rt_node_4
+{
+ rt_node n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+ Datum slots[4];
+} rt_node_4;
+
+typedef struct rt_node_16
+{
+ rt_node n;
+
+ /* 16 children, for key chunks */
+ uint8 chunks[16];
+ Datum slots[16];
+} rt_node_16;
+
+typedef struct rt_node_32
+{
+ rt_node n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+ Datum slots[32];
+} rt_node_32;
+
+#define RT_NODE_128_INVALID_IDX 0xFF
+typedef struct rt_node_128
+{
+ rt_node n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /*
+ * Slots for 128 children.
+ *
+ * Since the rt_node_xxx node is used by both inner and leaf nodes,
+ * we need to distinguish between a null pointer in inner nodes and
+ * a (Datum) 0 value in leaf node. isset is a bitmap to track which
+ * slot is in use.
+ */
+ Datum slots[128];
+ uint8 isset[RT_NODE_NSLOTS_BITS(128)];
+} rt_node_128;
+
+typedef struct rt_node_256
+{
+ rt_node n;
+
+ /*
+ * Slots for 256 children. The isset is a bitmap to track which slot
+ * is in use.
+ */
+ Datum slots[RT_NODE_MAX_SLOTS];
+ uint8 isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+} rt_node_256;
+
+/* Information of each size class */
+typedef struct rt_node_info_elem
+{
+ const char *name;
+ int max_slots;
+ Size size;
+} rt_node_info_elem;
+
+static rt_node_info_elem rt_node_info[] =
+{
+ {"radix tree node 4", 4, sizeof(rt_node_4)},
+ {"radix tree node 16", 16, sizeof(rt_node_16)},
+ {"radix tree node 32", 32, sizeof(rt_node_32)},
+ {"radix tree node 128", 128, sizeof(rt_node_128)},
+ {"radix tree node 256", 256, sizeof(rt_node_256)},
+};
+
+/*
+ * The data structure for stacking the radix tree nodes.
+ *
+ * During deleting a key-value pair, we descend the radix tree while pushing
+ * the inner nodes. The stack can be freed by using rt_free_stack.
+ */
+typedef struct rt_stack_data
+{
+ rt_node *node;
+ struct rt_stack_data *parent;
+} rt_stack_data;
+typedef rt_stack_data *rt_stack;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending order
+ * of the key. To support this, the we iterate nodes of each level.
+ * rt_iter_node_data struct is used to track the iteration within a node.
+ * rt_iter has the array of this struct, stack, in order to track the iteration
+ * of every level. During the iteration, we also construct the key to return. The key
+ * is updated whenever we update the node iteration information, e.g., when advancing
+ * the current index within the node or when moving to the next node at the same level.
+ */
+typedef struct rt_iter_node_data
+{
+ rt_node *node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} rt_iter_node_data;
+
+struct rt_iter
+{
+ radix_tree *tree;
+
+ /* Track the iteration on nodes of each level */
+ rt_iter_node_data stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ rt_node *root;
+ uint64 max_val;
+ uint64 num_keys;
+ MemoryContextData *slabs[RT_NODE_KIND_COUNT];
+
+ /* statistics */
+ uint64 mem_used;
+ int32 cnt[RT_NODE_KIND_COUNT];
+};
+
+static rt_node *rt_node_grow(radix_tree *tree, rt_node *parent,
+ rt_node *node, uint64 key);
+static bool rt_node_find_child(rt_node *node, rt_node **child_p, uint64 key);
+static bool rt_node_search(rt_node *node, Datum **slot_p, uint64 key,
+ rt_action action);
+static void rt_extend(radix_tree *tree, uint64 key);
+static void rt_new_root(radix_tree *tree, uint64 key, Datum val);
+static rt_node *rt_node_insert_child(radix_tree *tree,
+ rt_node *parent,
+ rt_node *node,
+ uint64 key);
+static void rt_node_insert_val(radix_tree *tree, rt_node *parent,
+ rt_node *node, uint64 key, Datum val,
+ bool *replaced_p);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+static Datum rt_node_iterate_next(rt_iter *iter, rt_iter_node_data *node_iter,
+ bool *found_p);
+static void rt_store_iter_node(rt_iter *iter, rt_iter_node_data *node_iter,
+ rt_node *node);
+static void rt_update_iter_stack(rt_iter *iter, int from);
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Helper functions for accessing each kind of nodes.
+ */
+
+static inline int
+node_16_search_eq(rt_node_16 *node, uint8 chunk)
+{
+/*
+ * On Windows, even if we use SSE intrinsics, pg_rightmost_one_pos32 is slow.
+ * So we guard with HAVE__BUILTIN_CTZ as well.
+ *
+ * XXX: once we have the correct interfaces to pg_bitutils.h for Windows
+ * we can remove the HAVE__BUILTIN_CTZ condition.
+ */
+#if defined(__SSE2__) && defined(HAVE__BUILTIN_CTZ)
+ __m128i key_v = _mm_set1_epi8(chunk);
+ __m128i data_v = _mm_loadu_si128((__m128i_u *) node->chunks);
+ __m128i cmp_v = _mm_cmpeq_epi8(key_v, data_v);
+ uint32 bitfield = _mm_movemask_epi8(cmp_v);
+
+ bitfield &= ((1 << node->n.count) - 1);
+
+ return bitfield ? pg_rightmost_one_pos32(bitfield) : -1;
+#else
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] > chunk)
+ return -1;
+
+ if (node->chunks[i] == chunk)
+ return i;
+ }
+
+ return -1;
+#endif
+}
+
+/*
+ * This is a bit more complicated than search_chunk_array_16_eq(), because
+ * until recently no unsigned uint8 comparison instruction existed on x86. So
+ * we need to play some trickery using _mm_min_epu8() to effectively get
+ * <=. There never will be any equal elements in the current uses, but that's
+ * what we get here...
+ */
+static inline int
+node_16_search_le(rt_node_16 *node, uint8 chunk)
+{
+#if defined(__SSE2__) && defined(HAVE__BUILTIN_CTZ)
+ __m128i key_v = _mm_set1_epi8(chunk);
+ __m128i data_v = _mm_loadu_si128((__m128i_u *) node->chunks);
+ __m128i min_v = _mm_min_epu8(data_v, key_v);
+ __m128i cmp_v = _mm_cmpeq_epi8(key_v, min_v);
+ uint32 bitfield = _mm_movemask_epi8(cmp_v);
+
+ bitfield &= ((1 << node->n.count) - 1);
+
+ return (bitfield) ? pg_rightmost_one_pos32(bitfield) : node->n.count;
+#else
+ int index;
+
+ for (index = 0; index < node->n.count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+
+ return index;
+#endif
+}
+
+static inline int
+node_32_search_eq(rt_node_32 *node, uint8 chunk)
+{
+#if defined(__SSE2__) && defined(HAVE__BUILTIN_CTZ)
+ int index = 0;
+ __m128i key_v = _mm_set1_epi8(chunk);
+
+ while (index < node->n.count)
+ {
+ __m128i data_v = _mm_loadu_si128((__m128i_u *) &(node->chunks[index]));
+ __m128i cmp_v = _mm_cmpeq_epi8(key_v, data_v);
+ uint32 bitfield = _mm_movemask_epi8(cmp_v);
+
+ bitfield &= ((UINT64CONST(1) << node->n.count) - 1);
+
+ if (bitfield)
+ {
+ index += pg_rightmost_one_pos32(bitfield);
+ break;
+ }
+
+ index += 16;
+ }
+
+ return (index < node->n.count) ? index : -1;
+#else
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] > chunk)
+ return -1;
+
+ if (node->chunks[i] == chunk)
+ return i;
+ }
+
+ return -1;
+#endif
+}
+
+/*
+ * Similar to node_16_search_le we need to play some trickery using
+ * _mm_min_epu8() to effectively get <=. There never will be any equal elements
+ * in the current uses, but that's what we get here...
+ */
+static inline int
+node_32_search_le(rt_node_32 *node, uint8 chunk)
+{
+#if defined(__SSE2__) && defined(HAVE__BUILTIN_CTZ)
+ int index = 0;
+ bool found = false;
+ __m128i key_v = _mm_set1_epi8(chunk);
+
+ while (index < node->n.count)
+ {
+ __m128i data_v = _mm_loadu_si128((__m128i_u *) &(node->chunks[index]));
+ __m128i min_v = _mm_min_epu8(data_v, key_v);
+ __m128i cmp_v = _mm_cmpeq_epi8(key_v, min_v);
+ uint32 bitfield = _mm_movemask_epi8(cmp_v);
+
+ bitfield &= ((UINT64CONST(1) << node->n.count)-1);
+
+ if (bitfield)
+ {
+ index += pg_rightmost_one_pos32(bitfield);
+ found = true;
+ break;
+ }
+
+ index += 16;
+ }
+
+ return found ? index : node->n.count;
+#else
+ int index;
+
+ for (index = 0; index < node->n.count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+
+ return index;
+#endif
+}
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_128_is_chunk_used(rt_node_128 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_128_is_slot_used(rt_node_128 *node, uint8 slot)
+{
+ return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_128_set(rt_node_128 *node, uint8 chunk, Datum val)
+{
+ int slotpos;
+
+ /*
+ * Find an unused slot. We iterate over the isset bitmap per byte
+ * then check each bit.
+ */
+ for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+ {
+ if (node->isset[slotpos] < 0xFF)
+ break;
+ }
+ Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+ slotpos *= BITS_PER_BYTE;
+ while (node_128_is_slot_used(node, slotpos))
+ slotpos++;
+
+ node->slot_idxs[chunk] = slotpos;
+ node->slots[slotpos] = val;
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+}
+
+/* Delete the slot at the corresponding chunk */
+static inline void
+node_128_unset(rt_node_128 *node, uint8 chunk)
+{
+ int slotpos = node->slot_idxs[chunk];
+
+ if (!node_128_is_chunk_used(node, chunk))
+ return;
+
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+ node->slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+/* Return the slot data corresponding to the chunk */
+static inline Datum
+node_128_get_chunk_slot(rt_node_128 *node, uint8 chunk)
+{
+ return node->slots[node->slot_idxs[chunk]];
+}
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_256_is_chunk_used(rt_node_256 *node, uint8 chunk)
+{
+ return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_256_set(rt_node_256 *node, uint8 chunk, Datum slot)
+{
+ node->slots[chunk] = slot;
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_256_unset(rt_node_256 *node, uint8 chunk)
+{
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+inline static int
+key_get_shift(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_node_kind kind)
+{
+ rt_node *newnode;
+
+ newnode = (rt_node *) MemoryContextAllocZero(tree->slabs[kind],
+ rt_node_info[kind].size);
+ newnode->kind = kind;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_128)
+ {
+ rt_node_128 *n128 = (rt_node_128 *) newnode;
+
+ memset(&(n128->slot_idxs), RT_NODE_128_INVALID_IDX,
+ sizeof(n128->slot_idxs));
+ }
+
+ /* update the statistics */
+ tree->mem_used += GetMemoryChunkSpace(newnode);
+ tree->cnt[kind]++;
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->root == node)
+ tree->root = NULL;
+
+ /* update the statistics */
+ tree->mem_used -= GetMemoryChunkSpace(node);
+ tree->cnt[node->kind]--;
+
+ Assert(tree->mem_used >= 0);
+ Assert(tree->cnt[node->kind] >= 0);
+
+ pfree(node);
+}
+
+/* Free a stack made by rt_delete */
+static void
+rt_free_stack(rt_stack stack)
+{
+ rt_stack ostack;
+
+ while (stack != NULL)
+ {
+ ostack = stack;
+ stack = stack->parent;
+ pfree(ostack);
+ }
+}
+
+/* Copy the common fields without the kind */
+static void
+rt_copy_node_common(rt_node *src, rt_node *dst)
+{
+ dst->shift = src->shift;
+ dst->chunk = src->chunk;
+ dst->count = src->count;
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+ int target_shift;
+ int shift = tree->root->shift + RT_NODE_SPAN;
+
+ target_shift = key_get_shift(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ rt_node_4 *node =
+ (rt_node_4 *) rt_alloc_node(tree, RT_NODE_KIND_4);
+
+ node->n.count = 1;
+ node->n.shift = shift;
+ node->chunks[0] = 0;
+ node->slots[0] = PointerGetDatum(tree->root);
+
+ tree->root->chunk = 0;
+ tree->root = (rt_node *) node;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * Wrapper for rt_node_search to search the pointer to the child node in the
+ * node.
+ *
+ * Return true if the corresponding child is found, otherwise return false. On success,
+ * it sets child_p.
+ */
+static bool
+rt_node_find_child(rt_node *node, rt_node **child_p, uint64 key)
+{
+ bool found = false;
+ Datum *slot_ptr;
+
+ if (rt_node_search(node, &slot_ptr, key, RT_ACTION_FIND))
+ {
+ /* Found the pointer to the child node */
+ found = true;
+ *child_p = (rt_node *) DatumGetPointer(*slot_ptr);
+ }
+
+ return found;
+}
+
+/*
+ * Return true if the corresponding slot is used, otherwise return false. On success,
+ * sets the pointer to the slot to slot_p.
+ */
+static bool
+rt_node_search(rt_node *node, Datum **slot_p, uint64 key,
+ rt_action action)
+{
+ int chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_4 *n4 = (rt_node_4 *) node;
+
+ /* Do linear search */
+ for (int i = 0; i < n4->n.count; i++)
+ {
+ if (n4->chunks[i] > chunk)
+ break;
+
+ /*
+ * If we find the chunk in the node, do the specified
+ * action
+ */
+ if (n4->chunks[i] == chunk)
+ {
+ if (action == RT_ACTION_FIND)
+ *slot_p = &(n4->slots[i]);
+ else /* RT_ACTION_DELETE */
+ {
+ memmove(&(n4->chunks[i]), &(n4->chunks[i + 1]),
+ sizeof(uint8) * (n4->n.count - i - 1));
+ memmove(&(n4->slots[i]), &(n4->slots[i + 1]),
+ sizeof(rt_node *) * (n4->n.count - i - 1));
+ }
+
+ found = true;
+ break;
+ }
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_16:
+ {
+ rt_node_16 *n16 = (rt_node_16 *) node;
+ int idx;
+
+ /* Search by SIMD instructions */
+ idx = node_16_search_eq(n16, chunk);
+
+ /* If we find the chunk in the node, do the specified action */
+ if (idx >= 0)
+ {
+ if (action == RT_ACTION_FIND)
+ *slot_p = &(n16->slots[idx]);
+ else /* RT_ACTION_DELETE */
+ {
+ memmove(&(n16->chunks[idx]), &(n16->chunks[idx + 1]),
+ sizeof(uint8) * (n16->n.count - idx - 1));
+ memmove(&(n16->slots[idx]), &(n16->slots[idx + 1]),
+ sizeof(rt_node *) * (n16->n.count - idx - 1));
+ }
+
+ found = true;
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_32 *n32 = (rt_node_32 *) node;
+ int idx;
+
+ /* Search by SIMD instructions */
+ idx = node_32_search_eq(n32, chunk);
+
+ /* If we find the chunk in the node, do the specified action */
+ if (idx >= 0)
+ {
+ if (action == RT_ACTION_FIND)
+ *slot_p = &(n32->slots[idx]);
+ else /* RT_ACTION_DELETE */
+ {
+ memmove(&(n32->chunks[idx]), &(n32->chunks[idx + 1]),
+ sizeof(uint8) * (n32->n.count - idx - 1));
+ memmove(&(n32->slots[idx]), &(n32->slots[idx + 1]),
+ sizeof(rt_node *) * (n32->n.count - idx - 1));
+ }
+
+ found = true;
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_128 *n128 = (rt_node_128 *) node;
+
+ /* If we find the chunk in the node, do the specified action */
+ if (node_128_is_chunk_used(n128, chunk))
+ {
+ if (action == RT_ACTION_FIND)
+ *slot_p = &(n128->slots[n128->slot_idxs[chunk]]);
+ else /* RT_ACTION_DELETE */
+ node_128_unset(n128, chunk);
+
+ found = true;
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_256 *n256 = (rt_node_256 *) node;
+
+ /* If we find the chunk in the node, do the specified action */
+ if (node_256_is_chunk_used(n256, chunk))
+ {
+ if (action == RT_ACTION_FIND)
+ *slot_p = &(n256->slots[chunk]);
+ else /* RT_ACTION_DELETE */
+ node_256_unset(n256, chunk);
+
+ found = true;
+ }
+
+ break;
+ }
+ }
+
+ /* Update the statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ return found;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key, Datum val)
+{
+ rt_node_4 *n4 =
+ (rt_node_4 *) rt_alloc_node(tree, RT_NODE_KIND_4);
+ int shift = key_get_shift(key);
+
+ n4->n.shift = shift;
+ tree->max_val = shift_get_max_val(shift);
+ tree->root = (rt_node *) n4;
+}
+
+/* Insert 'node' as a child node of 'parent' */
+static rt_node *
+rt_node_insert_child(radix_tree *tree, rt_node *parent,
+ rt_node *node, uint64 key)
+{
+ rt_node *newchild =
+ (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4);
+
+ Assert(!IS_LEAF_NODE(node));
+
+ newchild->shift = node->shift - RT_NODE_SPAN;
+ newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+ rt_node_insert_val(tree, parent, node, key, PointerGetDatum(newchild), NULL);
+
+ return (rt_node *) newchild;
+}
+
+/*
+ * Insert the value to the node. The node grows if it's full.
+ */
+static void
+rt_node_insert_val(radix_tree *tree, rt_node *parent,
+ rt_node *node, uint64 key, Datum val,
+ bool *replaced_p)
+{
+ int chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool replaced = false;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_4 *n4 = (rt_node_4 *) node;
+ int idx;
+
+ for (idx = 0; idx < n4->n.count; idx++)
+ {
+ if (n4->chunks[idx] >= chunk)
+ break;
+ }
+
+ if (NODE_HAS_FREE_SLOT(n4))
+ {
+ if (n4->n.count == 0)
+ {
+ /* the first key for this node, add it */
+ }
+ else if (n4->chunks[idx] == chunk)
+ {
+ /* found the key, replace it */
+ replaced = true;
+ }
+ else if (idx != n4->n.count)
+ {
+ /*
+ * the key needs to be inserted in the middle of the
+ * array, make space for the new key.
+ */
+ memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]),
+ sizeof(uint8) * (n4->n.count - idx));
+ memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]),
+ sizeof(Datum) * (n4->n.count - idx));
+ }
+
+ n4->chunks[idx] = chunk;
+ n4->slots[idx] = val;
+
+ /* Done */
+ break;
+ }
+
+ /* The node doesn't have free slot so needs to grow */
+ node = rt_node_grow(tree, parent, node, key);
+ Assert(node->kind == RT_NODE_KIND_16);
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_16:
+ {
+ rt_node_16 *n16 = (rt_node_16 *) node;
+ int idx;
+
+ idx = node_16_search_le(n16, chunk);
+
+ if (NODE_HAS_FREE_SLOT(n16))
+ {
+ if (n16->n.count == 0)
+ {
+ /* first key for this node, add it */
+ }
+ else if (n16->chunks[idx] == chunk)
+ {
+ /* found the key, replace it */
+ replaced = true;
+ }
+ else if (idx != n16->n.count)
+ {
+ /*
+ * the key needs to be inserted in the middle of the
+ * array, make space for the new key.
+ */
+ memmove(&(n16->chunks[idx + 1]), &(n16->chunks[idx]),
+ sizeof(uint8) * (n16->n.count - idx));
+ memmove(&(n16->slots[idx + 1]), &(n16->slots[idx]),
+ sizeof(Datum) * (n16->n.count - idx));
+ }
+
+ n16->chunks[idx] = chunk;
+ n16->slots[idx] = val;
+
+ /* Done */
+ break;
+ }
+
+ /* The node doesn't have free slot so needs to grow */
+ node = rt_node_grow(tree, parent, node, key);
+ Assert(node->kind == RT_NODE_KIND_32);
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_32 *n32 = (rt_node_32 *) node;
+ int idx;
+
+ idx = node_32_search_le(n32, chunk);
+
+ if (NODE_HAS_FREE_SLOT(n32))
+ {
+ if (n32->n.count == 0)
+ {
+ /* first key for this node, add it */
+ }
+ else if (n32->chunks[idx] == chunk)
+ {
+ /* found the key, replace it */
+ replaced = true;
+ }
+ else if (idx != n32->n.count)
+ {
+ /*
+ * the key needs to be inserted in the middle of the
+ * array, make space for the new key.
+ */
+ memmove(&(n32->chunks[idx + 1]), &(n32->chunks[idx]),
+ sizeof(uint8) * (n32->n.count - idx));
+ memmove(&(n32->slots[idx + 1]), &(n32->slots[idx]),
+ sizeof(Datum) * (n32->n.count - idx));
+ }
+
+ n32->chunks[idx] = chunk;
+ n32->slots[idx] = val;
+
+ /* Done */
+ break;
+ }
+
+ /* The node doesn't have free slot so needs to grow */
+ node = rt_node_grow(tree, parent, node, key);
+ Assert(node->kind == RT_NODE_KIND_128);
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_128:
+ {
+ rt_node_128 *n128 = (rt_node_128 *) node;
+
+ if (node_128_is_chunk_used(n128, chunk))
+ {
+ /* found the existing value */
+ node_128_set(n128, chunk, val);
+ replaced = true;
+ break;
+ }
+
+ if (NODE_HAS_FREE_SLOT(n128))
+ {
+ node_128_set(n128, chunk, val);
+
+ /* Done */
+ break;
+ }
+
+ /* The node doesn't have free slot so needs to grow */
+ node = rt_node_grow(tree, parent, node, key);
+ Assert(node->kind == RT_NODE_KIND_256);
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_256 *n256 = (rt_node_256 *) node;
+
+ if (node_256_is_chunk_used(n256, chunk))
+ replaced = true;
+
+ node_256_set(n256, chunk, val);
+
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!replaced)
+ node->count++;
+
+ if (replaced_p)
+ *replaced_p = replaced;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+}
+
+/* Change the node type to the next larger one */
+static rt_node *
+rt_node_grow(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key)
+{
+ rt_node *newnode = NULL;
+
+ Assert(node->count == rt_node_info[node->kind].max_slots);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_4 *n4 = (rt_node_4 *) node;
+ rt_node_16 *new16 =
+ (rt_node_16 *) rt_alloc_node(tree, RT_NODE_KIND_16);
+
+ rt_copy_node_common((rt_node *) n4,
+ (rt_node *) new16);
+
+ /* Copy both chunks and slots to the new node */
+ memcpy(&(new16->chunks), &(n4->chunks), sizeof(uint8) * 4);
+ memcpy(&(new16->slots), &(n4->slots), sizeof(Datum) * 4);
+
+ newnode = (rt_node *) new16;
+ break;
+ }
+ case RT_NODE_KIND_16:
+ {
+ rt_node_16 *n16 = (rt_node_16 *) node;
+ rt_node_32 *new32 =
+ (rt_node_32 *) rt_alloc_node(tree, RT_NODE_KIND_32);
+
+ rt_copy_node_common((rt_node *) n16,
+ (rt_node *) new32);
+
+ /* Copy both chunks and slots to the new node */
+ memcpy(&(new32->chunks), &(n16->chunks), sizeof(uint8) * 16);
+ memcpy(&(new32->slots), &(n16->slots), sizeof(Datum) * 16);
+
+ newnode = (rt_node *) new32;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_32 *n32 = (rt_node_32 *) node;
+ rt_node_128 *new128 =
+ (rt_node_128 *) rt_alloc_node(tree, RT_NODE_KIND_128);
+
+ /* Copy both chunks and slots to the new node */
+ rt_copy_node_common((rt_node *) n32,
+ (rt_node *) new128);
+
+ for (int i = 0; i < n32->n.count; i++)
+ node_128_set(new128, n32->chunks[i], n32->slots[i]);
+
+ newnode = (rt_node *) new128;
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_128 *n128 = (rt_node_128 *) node;
+ rt_node_256 *new256 =
+ (rt_node_256 *) rt_alloc_node(tree, RT_NODE_KIND_256);
+ int cnt = 0;
+
+ rt_copy_node_common((rt_node *) n128,
+ (rt_node *) new256);
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->n.count; i++)
+ {
+ if (!node_128_is_chunk_used(n128, i))
+ continue;
+
+ node_256_set(new256, i, node_128_get_chunk_slot(n128, i));
+ cnt++;
+ }
+
+ newnode = (rt_node *) new256;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ elog(ERROR, "radix tree node-256 cannot grow");
+ break;
+ }
+
+ if (parent == node)
+ {
+ /* Replace the root node with the new large node */
+ tree->root = newnode;
+ }
+ else
+ {
+ Datum *slot_ptr = NULL;
+
+ /* Redirect from the parent to the node */
+ rt_node_search(parent, &slot_ptr, key, RT_ACTION_FIND);
+ Assert(*slot_ptr);
+ *slot_ptr = PointerGetDatum(newnode);
+ }
+
+ /* Verify the node has grown properly */
+ rt_verify_node(newnode);
+
+ /* Free the old node */
+ rt_free_node(tree, node);
+
+ return newnode;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+ radix_tree *tree;
+ MemoryContext old_ctx;
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = palloc(sizeof(radix_tree));
+ tree->context = ctx;
+ tree->root = NULL;
+ tree->max_val = 0;
+ tree->num_keys = 0;
+ tree->mem_used = 0;
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ tree->slabs[i] = SlabContextCreate(ctx,
+ rt_node_info[i].name,
+ SLAB_DEFAULT_BLOCK_SIZE,
+ rt_node_info[i].size);
+ tree->cnt[i] = 0;
+ }
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ MemoryContextDelete(tree->slabs[i]);
+
+ pfree(tree);
+}
+
+/*
+ * Insert the key with the val.
+ *
+ * found_p is set to true if the key already present, otherwise false, if
+ * it's not NULL.
+ *
+ * XXX: do we need to support update_if_exists behavior?
+ */
+void
+rt_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
+{
+ int shift;
+ bool replaced;
+ rt_node *node;
+ rt_node *parent = tree->root;
+
+ /* Empty tree, create the root */
+ if (!tree->root)
+ rt_new_root(tree, key, val);
+
+ /* Extend the tree if necessary */
+ if (key > tree->max_val)
+ rt_extend(tree, key);
+
+ Assert(tree->root);
+
+ shift = tree->root->shift;
+ node = tree->root;
+ while (shift > 0)
+ {
+ rt_node *child;
+
+ if (!rt_node_find_child(node, &child, key))
+ child = rt_node_insert_child(tree, parent, node, key);
+
+ Assert(child != NULL);
+
+ parent = node;
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* arrived at a leaf */
+ Assert(IS_LEAF_NODE(node));
+
+ rt_node_insert_val(tree, parent, node, key, val, &replaced);
+
+ /* Update the statistics */
+ if (!replaced)
+ tree->num_keys++;
+
+ if (found_p)
+ *found_p = replaced;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if the key is successfully
+ * found, otherwise return false. On success, we set the value to *val_p so
+ * it must not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, Datum *val_p)
+{
+ rt_node *node;
+ Datum *value_ptr;
+ int shift;
+
+ Assert(val_p);
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift > 0)
+ {
+ rt_node *child;
+
+ if (!rt_node_find_child(node, &child, key))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* We reached at a leaf node, search the corresponding slot */
+ Assert(IS_LEAF_NODE(node));
+
+ if (!rt_node_search(node, &value_ptr, key, RT_ACTION_FIND))
+ return false;
+
+ /* Found, set the value to return */
+ *val_p = *value_ptr;
+ return true;
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ int shift;
+ rt_stack stack = NULL;
+ bool deleted;
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ /*
+ * Descending the tree to search the key while building a stack of nodes
+ * we visited.
+ */
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ rt_node *child;
+ rt_stack new_stack;
+
+ new_stack = (rt_stack) palloc(sizeof(rt_stack_data));
+ new_stack->node = node;
+ new_stack->parent = stack;
+ stack = new_stack;
+
+ if (IS_LEAF_NODE(node))
+ break;
+
+ if (!rt_node_find_child(node, &child, key))
+ {
+ rt_free_stack(stack);
+ return false;
+ }
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /*
+ * Delete the key from the leaf node and recursively delete internal nodes
+ * if necessary.
+ */
+ Assert(IS_LEAF_NODE(stack->node));
+ while (stack != NULL)
+ {
+ rt_node *node;
+ Datum *slot;
+
+ /* pop the node from the stack */
+ node = stack->node;
+ stack = stack->parent;
+
+ deleted = rt_node_search(node, &slot, key, RT_ACTION_DELETE);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!IS_EMPTY_NODE(node))
+ break;
+
+ Assert(deleted);
+
+ /* The node became empty */
+ rt_free_node(tree, node);
+
+ /*
+ * If we eventually deleted the root node while recursively deleting
+ * empty nodes, we make the tree empty.
+ */
+ if (stack == NULL)
+ {
+ tree->root = NULL;
+ tree->max_val = 0;
+ }
+ }
+
+ if (deleted)
+ tree->num_keys--;
+
+ rt_free_stack(stack);
+ return deleted;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+ MemoryContext old_ctx;
+ rt_iter *iter;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (rt_iter *) palloc0(sizeof(rt_iter));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree)
+ return iter;
+
+ top_level = iter->tree->root->shift / RT_NODE_SPAN;
+
+ iter->stack_len = top_level;
+ iter->stack[top_level].node = iter->tree->root;
+ iter->stack[top_level].current_idx = -1;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ rt_update_iter_stack(iter, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Update the stack of the radix tree node while descending to the leaf from
+ * the 'from' level.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, int from)
+{
+ rt_node *node = iter->stack[from].node;
+ int level = from;
+
+ for (;;)
+ {
+ rt_iter_node_data *node_iter = &(iter->stack[level--]);
+ bool found;
+
+ /* Set the node to this level */
+ rt_store_iter_node(iter, node_iter, node);
+
+ /* Finish if we reached to the leaf node */
+ if (IS_LEAF_NODE(node))
+ break;
+
+ /* Advance to the next slot in the node */
+ node = (rt_node *)
+ DatumGetPointer(rt_node_iterate_next(iter, node_iter, &found));
+
+ /*
+ * Since we always get the first slot in the node, we have to found
+ * the slot.
+ */
+ Assert(found);
+ }
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, Datum *value_p)
+{
+ bool found = false;
+ Datum slot = (Datum) 0;
+
+ /* Empty tree */
+ if (!iter->tree)
+ return false;
+
+ for (;;)
+ {
+ rt_node *node;
+ rt_iter_node_data *node_iter;
+ int level;
+
+ /*
+ * Iterate node at each level from the bottom of the tree, i.e.,
+ * the lead node, until we find the next slot.
+ */
+ for (level = 0; level <= iter->stack_len; level++)
+ {
+ slot = rt_node_iterate_next(iter, &(iter->stack[level]), &found);
+
+ if (found)
+ break;
+ }
+
+ /* We could not find any new key-value pair, the iteration finished */
+ if (!found)
+ break;
+
+ /* found the next slot at the leaf node, return it */
+ if (level == 0)
+ {
+ *key_p = iter->key;
+ *value_p = slot;
+ break;
+ }
+
+ /*
+ * We have advanced slots more than one nodes including both the lead
+ * node and internal nodes. So we update the stack by descending to the
+ * left most leaf node from this level.
+ */
+ node = (rt_node *) DatumGetPointer(slot);
+ node_iter = &(iter->stack[level - 1]);
+ rt_store_iter_node(iter, node_iter, node);
+ rt_update_iter_stack(iter, level - 1);
+ }
+
+ return found;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+ pfree(iter);
+}
+
+/*
+ * Iterate over the given radix tree node and returns the next slot of the given
+ * node and set true to *found_p, if any. Otherwise, set false to *found_p.
+ */
+static Datum
+rt_node_iterate_next(rt_iter *iter, rt_iter_node_data *node_iter, bool *found_p)
+{
+ rt_node *node = node_iter->node;
+ Datum slot = (Datum) 0;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_4 *n4 = (rt_node_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+
+ if (node_iter->current_idx >= n4->n.count)
+ goto not_found;
+
+ slot = n4->slots[node_iter->current_idx];
+
+ /* Update the part of the key by the current chunk */
+ if (IS_LEAF_NODE(n4))
+ rt_iter_update_key(iter, n4->chunks[node_iter->current_idx], 0);
+
+ break;
+ }
+ case RT_NODE_KIND_16:
+ {
+ rt_node_16 *n16 = (rt_node_16 *) node;
+
+ node_iter->current_idx++;
+
+ if (node_iter->current_idx >= n16->n.count)
+ goto not_found;
+
+ slot = n16->slots[node_iter->current_idx];
+
+ /* Update the part of the key */
+ if (IS_LEAF_NODE(n16))
+ rt_iter_update_key(iter, n16->chunks[node_iter->current_idx], 0);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_32 *n32 = (rt_node_32 *) node;
+
+ node_iter->current_idx++;
+
+ if (node_iter->current_idx >= n32->n.count)
+ goto not_found;
+
+ slot = n32->slots[node_iter->current_idx];
+
+ /* Update the part of the key */
+ if (IS_LEAF_NODE(n32))
+ rt_iter_update_key(iter, n32->chunks[node_iter->current_idx], 0);
+
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_128 *n128 = (rt_node_128 *) node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < 256; i++)
+ {
+ if (node_128_is_chunk_used(n128, i))
+ break;
+ }
+
+ if (i >= 256)
+ goto not_found;
+
+ node_iter->current_idx = i;
+ slot = node_128_get_chunk_slot(n128, i);
+
+ /* Update the part of the key */
+ if (IS_LEAF_NODE(n128))
+ rt_iter_update_key(iter, node_iter->current_idx, 0);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_256 *n256 = (rt_node_256 *) node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < 256; i++)
+ {
+ if (node_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= 256)
+ goto not_found;
+
+ node_iter->current_idx = i;
+ slot = n256->slots[i];
+
+ /* Update the part of the key */
+ if (IS_LEAF_NODE(n256))
+ rt_iter_update_key(iter, node_iter->current_idx, 0);
+
+ break;
+ }
+ }
+
+ *found_p = true;
+ return slot;
+
+not_found:
+ *found_p = false;
+ return (Datum) 0;
+}
+
+/*
+ * Initialize and update the node iteration struct with the given radix tree
+ * node. This function also updates the part of the key by the chunk of the
+ * given node.
+ */
+static void
+rt_store_iter_node(rt_iter *iter, rt_iter_node_data *node_iter,
+ rt_node *node)
+{
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ rt_iter_update_key(iter, node->chunk, node->shift + RT_NODE_SPAN);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+ return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+ return tree->mem_used;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_4 *n4 = (rt_node_4 *) node;
+
+ /* Check if the chunks in the node are sorted */
+ for (int i = 1; i < n4->n.count; i++)
+ Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_16:
+ {
+ rt_node_16 *n16 = (rt_node_16 *) node;
+
+ /* Check if the chunks in the node are sorted */
+ for (int i = 1; i < n16->n.count; i++)
+ Assert(n16->chunks[i - 1] < n16->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_32 *n32 = (rt_node_32 *) node;
+
+ /* Check if the chunks in the node are sorted */
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_128 *n128 = (rt_node_128 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_128_is_chunk_used(n128, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(node_128_is_slot_used(n128, n128->slot_idxs[i]));
+
+ cnt++;
+ }
+
+ Assert(n128->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_256 *n256 = (rt_node_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+ cnt += pg_popcount32(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->n.count == cnt);
+
+ break;
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+ fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u(%lu), n16 = %u(%lu),n32 = %u(%lu), n128 = %u(%lu), n256 = %u(%lu)",
+ tree->num_keys,
+ tree->root->shift / RT_NODE_SPAN,
+ tree->cnt[0], tree->cnt[0] * sizeof(rt_node_4),
+ tree->cnt[1], tree->cnt[1] * sizeof(rt_node_16),
+ tree->cnt[2], tree->cnt[2] * sizeof(rt_node_32),
+ tree->cnt[3], tree->cnt[3] * sizeof(rt_node_128),
+ tree->cnt[4], tree->cnt[4] * sizeof(rt_node_256));
+ /* rt_dump(tree); */
+}
+
+static void
+rt_print_slot(StringInfo buf, uint8 chunk, Datum slot, int idx, bool is_leaf, int level)
+{
+ char space[128] = {0};
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ if (is_leaf)
+ appendStringInfo(buf, "%s[%d] \"0x%X\" val(0x%lX) LEAF\n",
+ space,
+ idx,
+ chunk,
+ DatumGetInt64(slot));
+ else
+ appendStringInfo(buf, "%s[%d] \"0x%X\" -> ",
+ space,
+ idx,
+ chunk);
+}
+
+static void
+rt_dump_node(rt_node *node, int level, StringInfo buf, bool recurse)
+{
+ bool is_leaf = IS_LEAF_NODE(node);
+
+ appendStringInfo(buf, "[\"%s\" type %d, cnt %u, shift %u, chunk \"0x%X\"] chunks:\n",
+ IS_LEAF_NODE(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_128) ? 128 : 256,
+ node->count, node->shift, node->chunk);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_4 *n4 = (rt_node_4 *) node;
+
+ for (int i = 0; i < n4->n.count; i++)
+ {
+ rt_print_slot(buf, n4->chunks[i], n4->slots[i], i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ rt_dump_node((rt_node *) n4->slots[i], level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_16:
+ {
+ rt_node_16 *n16 = (rt_node_16 *) node;
+
+ for (int i = 0; i < n16->n.count; i++)
+ {
+ rt_print_slot(buf, n16->chunks[i], n16->slots[i], i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ rt_dump_node((rt_node *) n16->slots[i], level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_32 *n32 = (rt_node_32 *) node;
+
+ for (int i = 0; i < n32->n.count; i++)
+ {
+ rt_print_slot(buf, n32->chunks[i], n32->slots[i], i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ rt_dump_node((rt_node *) n32->slots[i], level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_128 *n128 = (rt_node_128 *) node;
+
+ for (int j = 0; j < 256; j++)
+ {
+ if (!node_128_is_chunk_used(n128, j))
+ continue;
+
+ appendStringInfo(buf, "slot_idxs[%d]=%d, ", j, n128->slot_idxs[j]);
+ }
+ appendStringInfo(buf, "\nisset-bitmap:");
+ for (int j = 0; j < 16; j++)
+ {
+ appendStringInfo(buf, "%X ", (uint8) n128->isset[j]);
+ }
+ appendStringInfo(buf, "\n");
+
+ for (int i = 0; i < 256; i++)
+ {
+ if (!node_128_is_chunk_used(n128, i))
+ continue;
+
+ rt_print_slot(buf, i, node_128_get_chunk_slot(n128, i),
+ i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ rt_dump_node((rt_node *) node_128_get_chunk_slot(n128, i),
+ level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_256 *n256 = (rt_node_256 *) node;
+
+ for (int i = 0; i < 256; i++)
+ {
+ if (!node_256_is_chunk_used(n256, i))
+ continue;
+
+ rt_print_slot(buf, i, n256->slots[i], i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ rt_dump_node((rt_node *) n256->slots[i], level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+ StringInfoData buf;
+ rt_node *node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+ if (!tree->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->max_val)
+ {
+ elog(NOTICE, "key %lu (0x%lX) is larger than max val",
+ key, key);
+ return;
+ }
+
+ initStringInfo(&buf);
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ rt_dump_node(node, level, &buf, false);
+
+ if (IS_LEAF_NODE(node))
+ {
+ Datum *dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ rt_node_search(node, &dummy, key, RT_ACTION_FIND);
+
+ break;
+ }
+
+ if (!rt_node_find_child(node, &child, key))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+
+ elog(NOTICE, "\n%s", buf.data);
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+ StringInfoData buf;
+
+ initStringInfo(&buf);
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = %lu", tree->max_val);
+ rt_dump_node(tree->root, 0, &buf, true);
+ elog(NOTICE, "\n%s", buf.data);
+ elog(NOTICE, "-----------------------------------------------------------");
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..7efd4bb735
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+/* #define RT_DEBUG 1 */
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern bool rt_search(radix_tree *tree, uint64 key, Datum *val_p);
+extern void rt_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+extern void rt_free(radix_tree *tree);
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, Datum *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif /* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9090226daa..51b2514faf 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -24,6 +24,7 @@ SUBDIRS = \
test_parser \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..cc6970c87c
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,28 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..384b1fc41d
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,503 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+/* The maximum number of entries each node type can have */
+static int rt_node_max_entries[] = {
+ 4, /* RT_NODE_KIND_4 */
+ 16, /* RT_NODE_KIND_16 */
+ 32, /* RT_NODE_KIND_32 */
+ 128, /* RT_NODE_KIND_128 */
+ 256 /* RT_NODE_KIND_256 */
+};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 10000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ radix_tree *radixtree;
+ Datum dummy;
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ Datum val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (DatumGetUInt64(val) != key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, DatumGetUInt64(val), key);
+ }
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ rt_insert(radixtree, key, Int64GetDatum(key), &found);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+ for (int j = 0; j < lengthof(rt_node_max_entries); j++)
+ {
+ /*
+ * After filling all slots in each node type, check if the values are
+ * stored properly.
+ */
+ if (i == (rt_node_max_entries[j] - 1))
+ {
+ check_search_on_node(radixtree, shift,
+ (j == 0) ? 0 : rt_node_max_entries[j - 1],
+ rt_node_max_entries[j]);
+ break;
+ }
+ }
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "inserted key 0x" UINT64_HEX_FORMAT " is not found", key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ radix_tree *radixtree;
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search
+ * entries again.
+ */
+ test_node_types_insert(radixtree, shift);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift);
+
+ rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec *spec)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+ radixtree = rt_create(radixtree_ctx);
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ rt_insert(radixtree, x, Int64GetDatum(x), &found);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the
+ * stats from the memory context. They should be in the same ballpark,
+ * but it's hard to automate testing that, so if you're making changes to
+ * the implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ Datum v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (DatumGetUInt64(v) != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ DatumGetUInt64(v), x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ if (DatumGetUInt64(val) != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ uint64 x;
+ Datum v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
On Tue, Jul 5, 2022 at 6:18 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-06-16 13:56:55 +0900, Masahiko Sawada wrote:
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c new file mode 100644 index 0000000000..bf87f932fd --- /dev/null +++ b/src/backend/lib/radixtree.c @@ -0,0 +1,1763 @@ +/*------------------------------------------------------------------------- + * + * radixtree.c + * Implementation for adaptive radix tree. + * + * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful + * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas + * Neumann, 2013. + * + * There are some differences from the proposed implementation. For instance, + * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit + * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper. + * Also, there is no support for path compression and lazy path expansion. The + * radix tree supports fixed length of the key so we don't expect the tree level + * wouldn't be high.I think we're going to need path compression at some point, fwiw. I'd bet on
it being beneficial even for the tid case.+ * The key is a 64-bit unsigned integer and the value is a Datum.
I don't think it's a good idea to define the value type to be a datum.
A datum value is convenient to represent both a pointer and a value so
I used it to avoid defining node types for inner and leaf nodes
separately. Since a datum could be 4 bytes or 8 bytes depending it
might not be good for some platforms. But what kind of aspects do you
not like the idea of using datum?
+/* + * As we descend a radix tree, we push the node to the stack. The stack is used + * at deletion. + */ +typedef struct radix_tree_stack_data +{ + radix_tree_node *node; + struct radix_tree_stack_data *parent; +} radix_tree_stack_data; +typedef radix_tree_stack_data *radix_tree_stack;I think it's a very bad idea for traversal to need allocations. I really want
to eventually use this for shared structures (eventually with lock-free
searches at least), and needing to do allocations while traversing the tree is
a no-go for that.Particularly given that the tree currently has a fixed depth, can't you just
allocate this on the stack once?
Yes, we can do that.
+/* + * Allocate a new node with the given node kind. + */ +static radix_tree_node * +radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind) +{ + radix_tree_node *newnode; + + newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind], + radix_tree_node_info[kind].size); + newnode->kind = kind; + + /* update the statistics */ + tree->mem_used += GetMemoryChunkSpace(newnode); + tree->cnt[kind]++; + + return newnode; +}Why are you tracking the memory usage at this level of detail? It's *much*
cheaper to track memory usage via the memory contexts? Since they're dedicated
for the radix tree, that ought to be sufficient?
Indeed. I'll use MemoryContextMemAllocated instead.
+ else if (idx != n4->n.count) + { + /* + * the key needs to be inserted in the middle of the + * array, make space for the new key. + */ + memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]), + sizeof(uint8) * (n4->n.count - idx)); + memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]), + sizeof(radix_tree_node *) * (n4->n.count - idx)); + }Maybe we could add a static inline helper for these memmoves? Both because
it's repetitive (for different node types) and because the last time I looked
gcc was generating quite bad code for this. And having to put workarounds into
multiple places is obviously worse than having to do it in one place.
Agreed, I'll update it.
+/* + * Insert the key with the val. + * + * found_p is set to true if the key already present, otherwise false, if + * it's not NULL. + * + * XXX: do we need to support update_if_exists behavior? + */Yes, I think that's needed - hence using bfm_set() instead of insert() in the
prototype.
Agreed.
+void +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p) +{ + int shift; + bool replaced; + radix_tree_node *node; + radix_tree_node *parent = tree->root; + + /* Empty tree, create the root */ + if (!tree->root) + radix_tree_new_root(tree, key, val); + + /* Extend the tree if necessary */ + if (key > tree->max_val) + radix_tree_extend(tree, key);FWIW, the reason I used separate functions for these in the prototype is that
it turns out to generate a lot better code, because it allows non-inlined
function calls to be sibling calls - thereby avoiding the need for a dedicated
stack frame. That's not possible once you need a palloc or such, so splitting
off those call paths into dedicated functions is useful.
Thank you for the info. How much does using sibling call optimization
help the performance in this case? I think that these two cases are
used only a limited number of times: inserting the first key and
extending the tree.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Tue, Jul 5, 2022 at 7:00 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-06-28 15:24:11 +0900, Masahiko Sawada wrote:
In both test cases, There is not much difference between using AVX2
and SSE2. The more mode types, the more time it takes for loading the
data (see sse2_4_16_32_128_256).Yea, at some point the compiler starts using a jump table instead of branches,
and that turns out to be a good bit more expensive. And even with branches, it
obviously adds hard to predict branches. IIRC I fought a bit with the compiler
to avoid some of that cost, it's possible that got "lost" in Sawada-san's
patch.Sawada-san, what led you to discard the 1 and 16 node types? IIRC the 1 node
one is not unimportant until we have path compression.
I wanted to start with a smaller number of node types for simplicity.
16 node type has been added to v4 patch I submitted[1]. I think it's
trade-off between better memory and the overhead of growing (and
shrinking) the node type. I'm going to add more node types once we
turn out based on the benchmark that it's beneficial.
Right now the node struct sizes are:
4 - 48 bytes
32 - 296 bytes
128 - 1304 bytes
256 - 2088 bytesI guess radix_tree_node_128->isset is just 16 bytes compared to 1288 other
bytes, but needing that separate isset array somehow is sad :/. I wonder if a
smaller "free index" would do the trick? Point to the element + 1 where we
searched last and start a plain loop there. Particularly in an insert-only
workload that'll always work, and in other cases it'll still often work I
think.
radix_tree_node_128->isset is used to distinguish between null-pointer
in inner nodes and 0 in leaf nodes. So I guess we can have a flag to
indicate a leaf or an inner so that we can interpret (Datum) 0 as
either null-pointer or 0. Or if we define different data types for
inner and leaf nodes probably we don't need it.
One thing I was wondering about is trying to choose node types in
roughly-power-of-two struct sizes. It's pretty easy to end up with significant
fragmentation in the slabs right now when inserting as you go, because some of
the smaller node types will be freed but not enough to actually free blocks of
memory. If we instead have ~power-of-two sizes we could just use a single slab
of the max size, and carve out the smaller node types out of that largest
allocation.
You meant to manage memory allocation (and free) for smaller node
types by ourselves?
How about using different block size for different node types?
Btw, that fragmentation is another reason why I think it's better to track
memory usage via memory contexts, rather than doing so based on
GetMemoryChunkSpace().
Agreed.
Ideally, node16 and node32 would have the same code with a different
loop count (1 or 2). More generally, there is too much duplication of
code (noted by Andres in his PoC), and there are many variable names
with the node size embedded. This is a bit tricky to make more
general, so we don't need to try it yet, but ideally we would have
something similar to:switch (node->kind) // todo: inspect tagged pointer
{
case RADIX_TREE_NODE_KIND_4:
idx = node_search_eq(node, chunk, 4);
do_action(node, idx, 4, ...);
break;
case RADIX_TREE_NODE_KIND_32:
idx = node_search_eq(node, chunk, 32);
do_action(node, idx, 32, ...);
...
}FWIW, that should be doable with an inline function, if you pass it the memory
to the "array" rather than the node directly. Not so sure it's a good idea to
do dispatch between node types / search methods inside the helper, as you
suggest below:static pg_alwaysinline void
node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
{
if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
// do simple loop with (node_simple *) node;
else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
// do vectorized loop where available with (node_vec *) node;
...
}
Yeah, It's worth trying at some point.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Hi,
On 2022-07-05 16:33:17 +0900, Masahiko Sawada wrote:
On Tue, Jul 5, 2022 at 6:18 AM Andres Freund <andres@anarazel.de> wrote:
A datum value is convenient to represent both a pointer and a value so
I used it to avoid defining node types for inner and leaf nodes
separately.
I'm not convinced that's a good goal. I think we're going to want to have
different key and value types, and trying to unify leaf and inner nodes is
going to make that impossible.
Consider e.g. using it for something like a buffer mapping table - your key
might be way too wide to fit it sensibly into 64bit.
Since a datum could be 4 bytes or 8 bytes depending it might not be good for
some platforms.
Right - thats another good reason why it's problematic. A lot of key types
aren't going to be 4/8 bytes dependent on 32/64bit, but either / or.
+void +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p) +{ + int shift; + bool replaced; + radix_tree_node *node; + radix_tree_node *parent = tree->root; + + /* Empty tree, create the root */ + if (!tree->root) + radix_tree_new_root(tree, key, val); + + /* Extend the tree if necessary */ + if (key > tree->max_val) + radix_tree_extend(tree, key);FWIW, the reason I used separate functions for these in the prototype is that
it turns out to generate a lot better code, because it allows non-inlined
function calls to be sibling calls - thereby avoiding the need for a dedicated
stack frame. That's not possible once you need a palloc or such, so splitting
off those call paths into dedicated functions is useful.Thank you for the info. How much does using sibling call optimization
help the performance in this case? I think that these two cases are
used only a limited number of times: inserting the first key and
extending the tree.
It's not that it helps in the cases moved into separate functions - it's that
not having that code in the "normal" paths keeps the normal path faster.
Greetings,
Andres Freund
Hi,
On 2022-07-05 16:33:29 +0900, Masahiko Sawada wrote:
One thing I was wondering about is trying to choose node types in
roughly-power-of-two struct sizes. It's pretty easy to end up with significant
fragmentation in the slabs right now when inserting as you go, because some of
the smaller node types will be freed but not enough to actually free blocks of
memory. If we instead have ~power-of-two sizes we could just use a single slab
of the max size, and carve out the smaller node types out of that largest
allocation.You meant to manage memory allocation (and free) for smaller node
types by ourselves?
For all of them basically. Using a single slab allocator and then subdividing
the "common block size" into however many chunks that fit into a single node
type.
How about using different block size for different node types?
Not following...
Greetings,
Andres Freund
On Mon, Jul 4, 2022 at 12:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Looking at the node stats, and then your benchmark code, I think key
construction is a major influence, maybe more than node type. The
key/value scheme tested now makes sense:blockhi || blocklo || 9 bits of item offset
(with the leaf nodes containing a bit map of the lowest few bits of
this whole thing)We want the lower fanout nodes at the top of the tree and higher
fanout ones at the bottom.So more inner nodes can fit in CPU cache, right?
My thinking is, on average, there will be more dense space utilization
in the leaf bitmaps, and fewer inner nodes. I'm not quite sure about
cache, since with my idea a search might have to visit more nodes to
get the common negative result (indexed tid not found in vacuum's
list).
Note some consequences: If the table has enough columns such that much
fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
dense case the nodes above the leaves will have lower fanout (maybe
they will fit in a node32). Also, the bitmap values in the leaves will
be more empty. In other words, many tables in the wild *resemble* the
sparse case a bit, even if truly all tuples on the page are dead.Note also that the dense case in the benchmark above has ~4500 times
more keys than the sparse case, and uses about ~1000 times more
memory. But the runtime is only 2-3 times longer. That's interesting
to me.To optimize for the sparse case, it seems to me that the key/value would be
blockhi || 9 bits of item offset || blocklo
I believe that would make the leaf nodes more dense, with fewer inner
nodes, and could drastically speed up the sparse case, and maybe many
realistic dense cases.Does it have an effect on the number of inner nodes?
I'm curious to hear your thoughts.
Thank you for your analysis. It's worth trying. We use 9 bits for item
offset but most pages don't use all bits in practice. So probably it
might be better to move the most significant bit of item offset to the
left of blockhi. Or more simply:9 bits of item offset || blockhi || blocklo
A concern here is most tids won't use many bits in blockhi either,
most often far fewer, so this would make the tree higher, I think.
Each value of blockhi represents 0.5GB of heap (32TB max). Even with
very large tables I'm guessing most pages of interest to vacuum are
concentrated in a few of these 0.5GB "segments".
And it's possible path compression would change the tradeoffs here.
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Jul 5, 2022 at 5:09 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-07-05 16:33:17 +0900, Masahiko Sawada wrote:
On Tue, Jul 5, 2022 at 6:18 AM Andres Freund <andres@anarazel.de> wrote:
A datum value is convenient to represent both a pointer and a value so
I used it to avoid defining node types for inner and leaf nodes
separately.I'm not convinced that's a good goal. I think we're going to want to have
different key and value types, and trying to unify leaf and inner nodes is
going to make that impossible.Consider e.g. using it for something like a buffer mapping table - your key
might be way too wide to fit it sensibly into 64bit.
Right. It seems to be better to have an interface so that the user of
the radix tree can specify the arbitrary key size (and perhaps value
size too?) on creation. And we can have separate leaf node types that
have values instead of pointers. If the value size is less than
pointer size, we can have values within leaf nodes but if it’s bigger
probably the leaf node can have pointers to memory where to store the
value.
Since a datum could be 4 bytes or 8 bytes depending it might not be good for
some platforms.Right - thats another good reason why it's problematic. A lot of key types
aren't going to be 4/8 bytes dependent on 32/64bit, but either / or.+void +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p) +{ + int shift; + bool replaced; + radix_tree_node *node; + radix_tree_node *parent = tree->root; + + /* Empty tree, create the root */ + if (!tree->root) + radix_tree_new_root(tree, key, val); + + /* Extend the tree if necessary */ + if (key > tree->max_val) + radix_tree_extend(tree, key);FWIW, the reason I used separate functions for these in the prototype is that
it turns out to generate a lot better code, because it allows non-inlined
function calls to be sibling calls - thereby avoiding the need for a dedicated
stack frame. That's not possible once you need a palloc or such, so splitting
off those call paths into dedicated functions is useful.Thank you for the info. How much does using sibling call optimization
help the performance in this case? I think that these two cases are
used only a limited number of times: inserting the first key and
extending the tree.It's not that it helps in the cases moved into separate functions - it's that
not having that code in the "normal" paths keeps the normal path faster.
Thanks, understood.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Tue, Jul 5, 2022 at 5:49 PM John Naylor <john.naylor@enterprisedb.com> wrote:
On Mon, Jul 4, 2022 at 12:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Looking at the node stats, and then your benchmark code, I think key
construction is a major influence, maybe more than node type. The
key/value scheme tested now makes sense:blockhi || blocklo || 9 bits of item offset
(with the leaf nodes containing a bit map of the lowest few bits of
this whole thing)We want the lower fanout nodes at the top of the tree and higher
fanout ones at the bottom.So more inner nodes can fit in CPU cache, right?
My thinking is, on average, there will be more dense space utilization
in the leaf bitmaps, and fewer inner nodes. I'm not quite sure about
cache, since with my idea a search might have to visit more nodes to
get the common negative result (indexed tid not found in vacuum's
list).Note some consequences: If the table has enough columns such that much
fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
dense case the nodes above the leaves will have lower fanout (maybe
they will fit in a node32). Also, the bitmap values in the leaves will
be more empty. In other words, many tables in the wild *resemble* the
sparse case a bit, even if truly all tuples on the page are dead.Note also that the dense case in the benchmark above has ~4500 times
more keys than the sparse case, and uses about ~1000 times more
memory. But the runtime is only 2-3 times longer. That's interesting
to me.To optimize for the sparse case, it seems to me that the key/value would be
blockhi || 9 bits of item offset || blocklo
I believe that would make the leaf nodes more dense, with fewer inner
nodes, and could drastically speed up the sparse case, and maybe many
realistic dense cases.Does it have an effect on the number of inner nodes?
I'm curious to hear your thoughts.
Thank you for your analysis. It's worth trying. We use 9 bits for item
offset but most pages don't use all bits in practice. So probably it
might be better to move the most significant bit of item offset to the
left of blockhi. Or more simply:9 bits of item offset || blockhi || blocklo
A concern here is most tids won't use many bits in blockhi either,
most often far fewer, so this would make the tree higher, I think.
Each value of blockhi represents 0.5GB of heap (32TB max). Even with
very large tables I'm guessing most pages of interest to vacuum are
concentrated in a few of these 0.5GB "segments".
Right.
I guess that the tree height is affected by where garbages are, right?
For example, even if all garbage in the table is concentrated in
0.5GB, if they exist between 2^17 and 2^18 block, we use the first
byte of blockhi. If the table is larger than 128GB, the second byte of
the blockhi could be used depending on where the garbage exists.
Another variation of how to store TID would be that we use the block
number as a key and store a bitmap of the offset as a value. We can
use Bitmapset for example, or an approach like Roaring bitmap.
I think that at this stage it's better to define the design first. For
example, key size and value size, and these sizes are fixed or can be
set the arbitary size? Given the use case of buffer mapping, we would
need a wider key to store RelFileNode, ForkNumber, and BlockNumber. On
the other hand, limiting the key size is 64 bit integer makes the
logic simple, and possibly it could still be used in buffer mapping
cases by using a tree of a tree. For value size, if we support
different value sizes specified by the user, we can either embed
multiple values in the leaf node (called Multi-value leaves in ART
paper) or introduce a leaf node that stores one value (called
Single-value leaves).
And it's possible path compression would change the tradeoffs here.
Agreed.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Fri, Jul 8, 2022 at 9:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I guess that the tree height is affected by where garbages are, right?
For example, even if all garbage in the table is concentrated in
0.5GB, if they exist between 2^17 and 2^18 block, we use the first
byte of blockhi. If the table is larger than 128GB, the second byte of
the blockhi could be used depending on where the garbage exists.
Right.
Another variation of how to store TID would be that we use the block
number as a key and store a bitmap of the offset as a value. We can
use Bitmapset for example,
I like the idea of using existing code to set/check a bitmap if it's
convenient. But (in case that was implied here) I'd really like to
stay away from variable-length values, which would require
"Single-value leaves" (slow). I also think it's fine to treat the
key/value as just bits, and not care where exactly they came from, as
we've been talking about.
or an approach like Roaring bitmap.
This would require two new data structures instead of one. That
doesn't seem like a path to success.
I think that at this stage it's better to define the design first. For
example, key size and value size, and these sizes are fixed or can be
set the arbitary size?
I don't think we need to start over. Andres' prototype had certain
design decisions built in for the intended use case (although maybe
not clearly documented as such). Subsequent patches in this thread
substantially changed many design aspects. If there were any changes
that made things wonderful for vacuum, it wasn't explained, but Andres
did explain how some of these changes were not good for other uses.
Going to fixed 64-bit keys and values should still allow many future
applications, so let's do that if there's no reason not to.
For value size, if we support
different value sizes specified by the user, we can either embed
multiple values in the leaf node (called Multi-value leaves in ART
paper)
I don't think "Multi-value leaves" allow for variable-length values,
FWIW. And now I see I also used this term wrong in my earlier review
comment -- v3/4 don't actually use "multi-value leaves", but Andres'
does (going by the multiple leaf types). From the paper: "Multi-value
leaves: The values are stored in one of four different leaf node
types, which mirror the structure of inner nodes, but contain values
instead of pointers."
(It seems v3/v4 could be called a variation of "Combined pointer/value
slots: If values fit into pointers, no separate node types are
necessary. Instead, each pointer storage location in an inner node can
either store a pointer or a value." But without the advantage of
variable length keys).
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Jul 8, 2022 at 3:43 PM John Naylor <john.naylor@enterprisedb.com> wrote:
On Fri, Jul 8, 2022 at 9:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I guess that the tree height is affected by where garbages are, right?
For example, even if all garbage in the table is concentrated in
0.5GB, if they exist between 2^17 and 2^18 block, we use the first
byte of blockhi. If the table is larger than 128GB, the second byte of
the blockhi could be used depending on where the garbage exists.Right.
Another variation of how to store TID would be that we use the block
number as a key and store a bitmap of the offset as a value. We can
use Bitmapset for example,I like the idea of using existing code to set/check a bitmap if it's
convenient. But (in case that was implied here) I'd really like to
stay away from variable-length values, which would require
"Single-value leaves" (slow). I also think it's fine to treat the
key/value as just bits, and not care where exactly they came from, as
we've been talking about.or an approach like Roaring bitmap.
This would require two new data structures instead of one. That
doesn't seem like a path to success.
Agreed.
I think that at this stage it's better to define the design first. For
example, key size and value size, and these sizes are fixed or can be
set the arbitary size?I don't think we need to start over. Andres' prototype had certain
design decisions built in for the intended use case (although maybe
not clearly documented as such). Subsequent patches in this thread
substantially changed many design aspects. If there were any changes
that made things wonderful for vacuum, it wasn't explained, but Andres
did explain how some of these changes were not good for other uses.
Going to fixed 64-bit keys and values should still allow many future
applications, so let's do that if there's no reason not to.
I thought Andres pointed out that given that we store BufferTag (or
part of that) into the key, the fixed 64-bit keys might not be enough
for buffer mapping use cases. If we want to use wider keys more than
64-bit, we would need to consider it.
For value size, if we support
different value sizes specified by the user, we can either embed
multiple values in the leaf node (called Multi-value leaves in ART
paper)I don't think "Multi-value leaves" allow for variable-length values,
FWIW. And now I see I also used this term wrong in my earlier review
comment -- v3/4 don't actually use "multi-value leaves", but Andres'
does (going by the multiple leaf types). From the paper: "Multi-value
leaves: The values are stored in one of four different leaf node
types, which mirror the structure of inner nodes, but contain values
instead of pointers."
Right, but sorry I meant the user specifies the arbitrary fixed-size
value length on creation like we do in dynahash.c.
(It seems v3/v4 could be called a variation of "Combined pointer/value
slots: If values fit into pointers, no separate node types are
necessary. Instead, each pointer storage location in an inner node can
either store a pointer or a value." But without the advantage of
variable length keys).
Agreed.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Tue, Jul 12, 2022 at 8:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I think that at this stage it's better to define the design first. For
example, key size and value size, and these sizes are fixed or can be
set the arbitary size?I don't think we need to start over. Andres' prototype had certain
design decisions built in for the intended use case (although maybe
not clearly documented as such). Subsequent patches in this thread
substantially changed many design aspects. If there were any changes
that made things wonderful for vacuum, it wasn't explained, but Andres
did explain how some of these changes were not good for other uses.
Going to fixed 64-bit keys and values should still allow many future
applications, so let's do that if there's no reason not to.I thought Andres pointed out that given that we store BufferTag (or
part of that) into the key, the fixed 64-bit keys might not be enough
for buffer mapping use cases. If we want to use wider keys more than
64-bit, we would need to consider it.
It sounds like you've answered your own question, then. If so, I'm
curious what your current thinking is.
If we *did* want to have maximum flexibility, then "single-value
leaves" method would be the way to go, since it seems to be the
easiest way to have variable-length both keys and values. I do have a
concern that the extra pointer traversal would be a drag on
performance, and also require lots of small memory allocations. If we
happened to go that route, your idea upthread of using a bitmapset of
item offsets in the leaves sounds like a good fit for that.
I also have some concerns about also simultaneously trying to design
for the use for buffer mappings. I certainly want to make this good
for as many future uses as possible, and I'd really like to preserve
any optimizations already fought for. However, to make concrete
progress on the thread subject, I also don't think it's the most
productive use of time to get tied up about the fine details of
something that will not likely happen for several years at the
earliest.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Jul 14, 2022 at 1:17 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Tue, Jul 12, 2022 at 8:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I think that at this stage it's better to define the design first. For
example, key size and value size, and these sizes are fixed or can be
set the arbitary size?I don't think we need to start over. Andres' prototype had certain
design decisions built in for the intended use case (although maybe
not clearly documented as such). Subsequent patches in this thread
substantially changed many design aspects. If there were any changes
that made things wonderful for vacuum, it wasn't explained, but Andres
did explain how some of these changes were not good for other uses.
Going to fixed 64-bit keys and values should still allow many future
applications, so let's do that if there's no reason not to.I thought Andres pointed out that given that we store BufferTag (or
part of that) into the key, the fixed 64-bit keys might not be enough
for buffer mapping use cases. If we want to use wider keys more than
64-bit, we would need to consider it.It sounds like you've answered your own question, then. If so, I'm
curious what your current thinking is.If we *did* want to have maximum flexibility, then "single-value
leaves" method would be the way to go, since it seems to be the
easiest way to have variable-length both keys and values. I do have a
concern that the extra pointer traversal would be a drag on
performance, and also require lots of small memory allocations.
Agreed.
I also have some concerns about also simultaneously trying to design
for the use for buffer mappings. I certainly want to make this good
for as many future uses as possible, and I'd really like to preserve
any optimizations already fought for. However, to make concrete
progress on the thread subject, I also don't think it's the most
productive use of time to get tied up about the fine details of
something that will not likely happen for several years at the
earliest.
I’d like to keep the first version simple. We can improve it and add
more optimizations later. Using radix tree for vacuum TID storage
would still be a big win comparing to using a flat array, even without
all these optimizations. In terms of single-value leaves method, I'm
also concerned about an extra pointer traversal and extra memory
allocation. It's most flexible but multi-value leaves method is also
flexible enough for many use cases. Using the single-value method
seems to be too much as the first step for me.
Overall, using 64-bit keys and 64-bit values would be a reasonable
choice for me as the first step . It can cover wider use cases
including vacuum TID use cases. And possibly it can cover use cases by
combining a hash table or using tree of tree, for example.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Hi,
On 2022-07-08 11:09:44 +0900, Masahiko Sawada wrote:
I think that at this stage it's better to define the design first. For
example, key size and value size, and these sizes are fixed or can be
set the arbitary size? Given the use case of buffer mapping, we would
need a wider key to store RelFileNode, ForkNumber, and BlockNumber. On
the other hand, limiting the key size is 64 bit integer makes the
logic simple, and possibly it could still be used in buffer mapping
cases by using a tree of a tree. For value size, if we support
different value sizes specified by the user, we can either embed
multiple values in the leaf node (called Multi-value leaves in ART
paper) or introduce a leaf node that stores one value (called
Single-value leaves).
FWIW, I think the best path forward would be to do something similar to the
simplehash.h approach, so it can be customized to the specific user.
Greetings,
Andres Freund
On Tue, Jul 19, 2022 at 9:24 AM Andres Freund <andres@anarazel.de> wrote:
FWIW, I think the best path forward would be to do something similar to
the
simplehash.h approach, so it can be customized to the specific user.
I figured that would come up at some point. It may be worth doing in the
future, but I think it's way too much to ask for the first use case.
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Jul 18, 2022 at 9:10 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Tue, Jul 19, 2022 at 9:24 AM Andres Freund <andres@anarazel.de> wrote:
FWIW, I think the best path forward would be to do something similar to the
simplehash.h approach, so it can be customized to the specific user.I figured that would come up at some point. It may be worth doing in the future, but I think it's way too much to ask for the first use case.
I have a prototype patch that creates a read-only snapshot of the
visibility map, and has vacuumlazy.c work off of that when determining
with pages to skip. The patch also gets rid of the
SKIP_PAGES_THRESHOLD stuff. This is very effective with TPC-C,
principally because it really cuts down on the number of scanned_pages
that are scanned only because the VM bit is unset concurrently by DML.
The window for this is very large when the table is large (and
naturally takes a long time to scan), resulting in many more "dead but
not yet removable" tuples being encountered than necessary. Which
itself causes bogus information in the FSM -- information about the
space that VACUUM could free from the page, which is often highly
misleading.
There are remaining questions about how to do this properly. Right now
I'm just copying pages from the VM into local memory, right after
OldestXmin is first acquired -- we "lock in" a snapshot of the VM at
the earliest opportunity, which is what lazy_scan_skip() actually
works off now. There needs to be some consideration given to the
resource management aspects of this -- it needs to use memory
sensibly, which the current prototype patch doesn't do at all. I'm
probably going to seriously pursue this as a project soon, and will
probably need some kind of data structure for the local copy. The raw
pages are usually quite space inefficient, considering we only need an
immutable snapshot of the VM.
I wonder if it makes sense to use this as part of this project. It
will be possible to know the exact heap pages that will become
scanned_pages before scanning even one page with this design (perhaps
with caveats about low memory conditions). It could also be very
effective as a way of speeding up TID lookups in the reasonably common
case where most scanned_pages don't have any LP_DEAD items -- just
look it up in our local/materialized copy of the VM first. But even
when LP_DEAD items are spread fairly evenly, it could still give us
reliable information about the distribution of LP_DEAD items very
early on.
Maybe the two data structures could even be combined in some way? You
can use more memory for the local copy of the VM if you know that you
won't need the memory for dead_items. It's kinda the same problem, in
a way.
--
Peter Geoghegan
On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
I’d like to keep the first version simple. We can improve it and add
more optimizations later. Using radix tree for vacuum TID storage
would still be a big win comparing to using a flat array, even without
all these optimizations. In terms of single-value leaves method, I'm
also concerned about an extra pointer traversal and extra memory
allocation. It's most flexible but multi-value leaves method is also
flexible enough for many use cases. Using the single-value method
seems to be too much as the first step for me.Overall, using 64-bit keys and 64-bit values would be a reasonable
choice for me as the first step . It can cover wider use cases
including vacuum TID use cases. And possibly it can cover use cases by
combining a hash table or using tree of tree, for example.
These two aspects would also bring it closer to Andres' prototype, which 1)
makes review easier and 2) easier to preserve optimization work already
done, so +1 from me.
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Jul 19, 2022 at 1:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I’d like to keep the first version simple. We can improve it and add
more optimizations later. Using radix tree for vacuum TID storage
would still be a big win comparing to using a flat array, even without
all these optimizations. In terms of single-value leaves method, I'm
also concerned about an extra pointer traversal and extra memory
allocation. It's most flexible but multi-value leaves method is also
flexible enough for many use cases. Using the single-value method
seems to be too much as the first step for me.Overall, using 64-bit keys and 64-bit values would be a reasonable
choice for me as the first step . It can cover wider use cases
including vacuum TID use cases. And possibly it can cover use cases by
combining a hash table or using tree of tree, for example.These two aspects would also bring it closer to Andres' prototype, which 1) makes review easier and 2) easier to preserve optimization work already done, so +1 from me.
Thanks.
I've updated the patch. It now implements 64-bit keys, 64-bit values,
and the multi-value leaves method. I've tried to remove duplicated
codes but we might find a better way to do that.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Attachments:
radixtree_v5.patchapplication/octet-stream; name=radixtree_v5.patchDownload
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..ead0755d25 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,9 @@ OBJS = \
integerset.o \
knapsack.o \
pairingheap.o \
+ radixtree.o \
rbtree.o \
+radixtree.o: CFLAGS+=-msse2
+
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..1aececbf46
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2336 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
+ * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
+ * Also, there is no support for path compression and lazy path expansion. The
+ * radix tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The internal nodes and
+ * the leaf nodes have slightly different structure: for internal tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves". We
+ * choose it to avoid an additional pointer traversal. It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create - Create a new, empty radix tree
+ * rt_free - Free the radix tree
+ * rt_search - Search a key-value pair
+ * rt_set - Set a key-value pair
+ * rt_delete - Delete a key-value pair
+ * rt_begin_iter - Begin iterating through all key-value pairs
+ * rt_iter_next - Return next key-value pair, if any
+ * rt_end_iter - End iteration
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "utils/memutils.h"
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+
+#if defined(__SSE2__)
+#include <emmintrin.h> /* SSE2 intrinsics */
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-128 */
+#define RT_NODE_128_INVALID_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) \
+ ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+ RT_ACTION_FIND = 0, /* find the key-value */
+ RT_ACTION_DELETE, /* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree nodes.
+ *
+ * XXX: These are currently not well chosen. To reduce memory fragmentation
+ * smaller class should optimally fit neatly into the next larger class
+ * (except perhaps at the lowest end). Right now its
+ * 48 -> 152 -> 296 -> 1304 -> 2088 bytes for inner/leaf nodes, leading to
+ * large amounts of allocator padding with aset.c. Hence the use of slab.
+ *
+ * XXX: need to have node-1 until there is no path compression optimization?
+ *
+ * XXX: need to explain why we choose these node types based on benchmark
+ * results etc.
+ */
+typedef enum rt_node_kind
+{
+ RT_NODE_KIND_4 = 0,
+ RT_NODE_KIND_16,
+ RT_NODE_KIND_32,
+ RT_NODE_KIND_128,
+ RT_NODE_KIND_256
+} rt_node_kind;
+#define RT_NODE_KIND_COUNT (RT_NODE_KIND_256 + 1)
+
+/*
+ * Base type for all nodes types.
+ */
+typedef struct rt_node
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+ uint8 chunk;
+
+ /* Size class of the node */
+ rt_node_kind kind;
+} rt_node;
+
+/* Macros for radix tree nodes */
+#define IS_LEAF_NODE(n) (((rt_node *) (n))->shift == 0)
+#define IS_EMPTY_NODE(n) (((rt_node *) (n))->count == 0)
+#define NODE_HAS_FREE_SLOT(n) \
+ (((rt_node *) (n))->count < rt_node_info[((rt_node *) (n))->kind].fanout)
+
+/* Base types for inner and leaf nodes of each node type */
+typedef struct rd_node_base_4
+{
+ rt_node n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+} rt_node_base_4;
+
+typedef struct rd_node_base_16
+{
+ rt_node n;
+
+ /* 16 children, for key chunks */
+ uint8 chunks[16];
+} rt_node_base_16;
+
+typedef struct rd_node_base_32
+{
+ rt_node n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} rt_node_base_32;
+
+typedef struct rd_node_base_128
+{
+ rt_node n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(128)];
+} rt_node_base_128;
+
+typedef struct rd_node_base_256
+{
+ rt_node n;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * There are separate from inner node size classes for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ * width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+ rt_node_base_4 base;
+
+ /* 4 children, for key chunks */
+ rt_node *children[4];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+ rt_node_base_4 base;
+
+ /* 4 values, for key chunks */
+ uint64 values[4];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_16
+{
+ rt_node_base_16 base;
+
+ /* 16 children, for key chunks */
+ rt_node *children[16];
+} rt_node_inner_16;
+
+typedef struct rt_node_leaf_16
+{
+ rt_node_base_16 base;
+
+ /* 16 values, for key chunks */
+ uint64 values[16];
+} rt_node_leaf_16;
+
+typedef struct rt_node_inner_32
+{
+ rt_node_base_32 base;
+
+ /* 32 children, for key chunks */
+ rt_node *children[32];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+ rt_node_base_32 base;
+
+ /* 32 values, for key chunks */
+ uint64 values[32];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_128
+{
+ rt_node_base_128 base;
+
+ /* Slots for 128 children */
+ rt_node *children[128];
+} rt_node_inner_128;
+
+typedef struct rt_node_leaf_128
+{
+ rt_node_base_128 base;
+
+ /* Slots for 128 values */
+ uint64 values[128];
+} rt_node_leaf_128;
+
+typedef struct rt_node_inner_256
+{
+ rt_node_base_256 base;
+
+ /* Slots for 256 children */
+ rt_node *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+ rt_node_base_256 base;
+
+ /* Slots for 256 values */
+ uint64 values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information of each size class */
+typedef struct rt_node_info_elem
+{
+ const char *name;
+ int fanout;
+ Size inner_size;
+ Size leaf_size;
+} rt_node_info_elem;
+
+static rt_node_info_elem rt_node_info[RT_NODE_KIND_COUNT] = {
+
+ [RT_NODE_KIND_4] = {
+ .name = "radix tree node 4",
+ .fanout = 4,
+ .inner_size = sizeof(rt_node_inner_4),
+ .leaf_size = sizeof(rt_node_leaf_4),
+ },
+ [RT_NODE_KIND_16] = {
+ .name = "radix tree node 16",
+ .fanout = 16,
+ .inner_size = sizeof(rt_node_inner_16),
+ .leaf_size = sizeof(rt_node_leaf_16),
+ },
+ [RT_NODE_KIND_32] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(rt_node_inner_32),
+ .leaf_size = sizeof(rt_node_leaf_32),
+ },
+ [RT_NODE_KIND_128] = {
+ .name = "radix tree node 128",
+ .fanout = 128,
+ .inner_size = sizeof(rt_node_inner_128),
+ .leaf_size = sizeof(rt_node_leaf_128),
+ },
+ [RT_NODE_KIND_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(rt_node_inner_256),
+ .leaf_size = sizeof(rt_node_leaf_256),
+ },
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ * rt_iter_node_data struct is used to track the iteration within a node.
+ * rt_iter has the array of this struct, stack, in order to track the iteration
+ * of every level. During the iteration, we also construct the key to return
+ * whenever we update the node iteration information, e.g., when advancing the
+ * current index within the node or when moving to the next node at the same level.
+ */
+typedef struct rt_iter_node_data
+{
+ rt_node *node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} rt_iter_node_data;
+
+struct rt_iter
+{
+ radix_tree *tree;
+
+ /* Track the iteration on nodes of each level */
+ rt_iter_node_data stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ rt_node *root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+ MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+
+ /* statistics */
+ int32 cnt[RT_NODE_KIND_COUNT];
+};
+
+static rt_node *rt_node_grow(radix_tree *tree, rt_node *parent,
+ rt_node *node, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_node_kind kind, bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_copy_node_common(rt_node *src, rt_node *dst);
+static void rt_extend(radix_tree *tree, uint64 key);
+static void rt_new_root(radix_tree *tree, uint64 key);
+
+/* search */
+static bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+ rt_node **child_p);
+static bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+ uint64 *value_p);
+static bool rt_node_search(rt_node *node, uint64 key, rt_action action, void **slot_p);
+
+/* insertion */
+static rt_node *rt_node_add_new_child(radix_tree *tree, rt_node *parent,
+ rt_node *node, uint64 key);
+static int rt_node_prepare_insert(radix_tree *tree, rt_node *parent,
+ rt_node **node_p, uint64 key,
+ bool *will_replace_p);
+static void rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, rt_node *child, bool *replaced_p);
+static void rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value, bool *replaced_p);
+
+static rt_node *rt_alloc_node(radix_tree *tree, rt_node_kind kind, bool inner);
+static void rt_extend(radix_tree *tree, uint64 key);
+static void rt_new_root(radix_tree *tree, uint64 key);
+static void rt_copy_node_common(rt_node *src, rt_node *dst);
+
+/* iteration */
+static pg_attribute_always_inline void rt_iter_update_key(rt_iter *iter, uint8 chunk,
+ uint8 shift);
+static void *rt_node_iterate_next(rt_iter *iter, rt_iter_node_data *node_iter,
+ bool *found_p);
+static void rt_store_iter_node(rt_iter *iter, rt_iter_node_data *node_iter,
+ rt_node *node);
+static void rt_update_iter_stack(rt_iter *iter, int from);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * The fanout threshold to choice how to search the key in the chunk array.
+ *
+ * On platforms where vector instructions, we use the simple for-loop approach for
+ * all cases.
+ */
+#define RT_SIMPLE_LOOP_THRESHOLD 4 /* use simple for-loop */
+#define RT_VECRTORIZED_LOOP_THRESHOLD 32 /* use SIMD instructions */
+
+static pg_attribute_always_inline int
+search_chunk_array_eq(uint8 *chunks, uint8 key, uint8 node_fanout, uint8 node_count)
+{
+ if (node_fanout <= RT_SIMPLE_LOOP_THRESHOLD)
+ {
+ for (int i = 0; i < node_count; i++)
+ {
+ if (chunks[i] > key)
+ return -1;
+
+ if (chunks[i] == key)
+ return i;
+ }
+
+ return -1;
+ }
+ else if (node_fanout <= RT_VECRTORIZED_LOOP_THRESHOLD)
+ {
+ /*
+ * On Windows, even if we use SSE intrinsics, pg_rightmost_one_pos32
+ * is slow. So we guard with HAVE__BUILTIN_CTZ as well.
+ *
+ * XXX: once we have the correct interfaces to pg_bitutils.h for
+ * Windows we can remove the HAVE__BUILTIN_CTZ condition.
+ */
+#if defined(__SSE2__) && defined(HAVE__BUILTIN_CTZ)
+ int index = 0;
+ __m128i key_v = _mm_set1_epi8(key);
+
+ while (index < node_count)
+ {
+ __m128i data_v = _mm_loadu_si128((__m128i_u *) & (chunks[index]));
+ __m128i cmp_v = _mm_cmpeq_epi8(key_v, data_v);
+ uint32 bitfield = _mm_movemask_epi8(cmp_v);
+
+ bitfield &= ((UINT64CONST(1) << node_count) - 1);
+
+ if (bitfield)
+ {
+ index += pg_rightmost_one_pos32(bitfield);
+ break;
+ }
+
+ index += 16;
+ }
+
+ return (index < node_count) ? index : -1;
+#else
+ for (int i = 0; i < node_count; i++)
+ {
+ if (chunks[i] > key)
+ return -1;
+
+ if (chunks[i] == key)
+ return i;
+ }
+
+ return -1;
+#endif
+ }
+ else
+ elog(ERROR, "unsupported fanout size %u for chunk array search",
+ node_fanout);
+}
+
+/*
+ * This is a bit more complicated than search_chunk_array_16_eq(), because
+ * until recently no unsigned uint8 comparison instruction existed on x86. So
+ * we need to play some trickery using _mm_min_epu8() to effectively get
+ * <=. There never will be any equal elements in the current uses, but that's
+ * what we get here...
+ */
+static pg_attribute_always_inline int
+search_chunk_array_le(uint8 *chunks, uint8 key, uint8 node_fanout, uint8 node_count)
+{
+ if (node_fanout <= RT_SIMPLE_LOOP_THRESHOLD)
+ {
+ int index;
+
+ for (index = 0; index < node_count; index++)
+ {
+ if (chunks[index] >= key)
+ break;
+ }
+
+ return index;
+ }
+ else if (node_fanout <= RT_VECRTORIZED_LOOP_THRESHOLD)
+ {
+#if defined(__SSE2__) && defined(HAVE__BUILTIN_CTZ)
+ int index = 0;
+ bool found = false;
+ __m128i key_v = _mm_set1_epi8(key);
+
+ while (index < node_count)
+ {
+ __m128i data_v = _mm_loadu_si128((__m128i_u *) & (chunks[index]));
+ __m128i min_v = _mm_min_epu8(data_v, key_v);
+ __m128i cmp_v = _mm_cmpeq_epi8(key_v, min_v);
+ uint32 bitfield = _mm_movemask_epi8(cmp_v);
+
+ bitfield &= ((UINT64CONST(1) << node_count) - 1);
+
+ if (bitfield)
+ {
+ index += pg_rightmost_one_pos32(bitfield);
+ found = true;
+ break;
+ }
+
+ index += 16;
+ }
+
+ return found ? index : node_count;
+#else
+ int index;
+
+ for (index = 0; index < node_count; index++)
+ {
+ if (chunks[index] >= key)
+ break;
+ }
+
+ return index;
+#endif
+ }
+ else
+ elog(ERROR, "unsupported fanout size %u for chunk array search",
+ node_fanout);
+}
+
+/* Node support functions for all node types to get its children or values */
+
+/* Return the array of children in the inner node */
+static rt_node **
+rt_node_get_inner_children(rt_node *node)
+{
+ rt_node **children = NULL;
+
+ Assert(!IS_LEAF_NODE(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ children = (rt_node **) ((rt_node_inner_4 *) node)->children;
+ break;
+ case RT_NODE_KIND_16:
+ children = (rt_node **) ((rt_node_inner_16 *) node)->children;
+ break;
+ case RT_NODE_KIND_32:
+ children = (rt_node **) ((rt_node_inner_32 *) node)->children;
+ break;
+ case RT_NODE_KIND_128:
+ children = (rt_node **) ((rt_node_inner_128 *) node)->children;
+ break;
+ case RT_NODE_KIND_256:
+ children = (rt_node **) ((rt_node_inner_256 *) node)->children;
+ break;
+ default:
+ elog(ERROR, "unexpected node type %u", node->kind);
+ }
+
+ return children;
+}
+
+/* Return the array of values in the leaf node */
+static uint64 *
+rt_node_get_leaf_values(rt_node *node)
+{
+ uint64 *values = NULL;
+
+ Assert(IS_LEAF_NODE(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ values = ((rt_node_leaf_4 *) node)->values;
+ break;
+ case RT_NODE_KIND_16:
+ values = ((rt_node_leaf_16 *) node)->values;
+ break;
+ case RT_NODE_KIND_32:
+ values = ((rt_node_leaf_32 *) node)->values;
+ break;
+ case RT_NODE_KIND_128:
+ values = ((rt_node_leaf_128 *) node)->values;
+ break;
+ case RT_NODE_KIND_256:
+ values = ((rt_node_leaf_256 *) node)->values;
+ break;
+ default:
+ elog(ERROR, "unexpected node type %u", node->kind);
+ }
+
+ return values;
+}
+
+/*
+ * Node support functions for node-4, node-16, and node-32.
+ *
+ * These three node types have similar structure -- they have the array of chunks with
+ * different length and corresponding pointers or values depending on inner nodes or
+ * leaf nodes.
+ */
+#define ENSURE_CHUNK_ARRAY_NODE(node) \
+ Assert(((((rt_node*) node)->kind) == RT_NODE_KIND_4) || \
+ ((((rt_node*) node)->kind) == RT_NODE_KIND_16) || \
+ ((((rt_node*) node)->kind) == RT_NODE_KIND_32))
+
+/* Get the pointer to either the child or the value at 'idx */
+static void *
+chunk_array_node_get_slot(rt_node *node, int idx)
+{
+ void *slot;
+
+ ENSURE_CHUNK_ARRAY_NODE(node);
+
+ if (IS_LEAF_NODE(node))
+ {
+ uint64 *values = rt_node_get_leaf_values(node);
+
+ slot = (void *) &(values[idx]);
+ }
+ else
+ {
+ rt_node **children = rt_node_get_inner_children(node);
+
+ slot = (void *) children[idx];
+ }
+
+ return slot;
+}
+
+/* Return the chunk array in the node */
+static uint8 *
+chunk_array_node_get_chunks(rt_node *node)
+{
+ uint8 *chunk = NULL;
+
+ ENSURE_CHUNK_ARRAY_NODE(node);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ chunk = (uint8 *) ((rt_node_base_4 *) node)->chunks;
+ break;
+ case RT_NODE_KIND_16:
+ chunk = (uint8 *) ((rt_node_base_16 *) node)->chunks;
+ break;
+ case RT_NODE_KIND_32:
+ chunk = (uint8 *) ((rt_node_base_32 *) node)->chunks;
+ break;
+ default:
+ /* this function don't support node-128 and node-256 */
+ elog(ERROR, "unsupported node type %d", node->kind);
+ }
+
+ return chunk;
+}
+
+/* Copy the contents of the node from 'src' to 'dst' */
+static void
+chunk_array_node_copy_chunks_and_slots(rt_node *src, rt_node *dst)
+{
+ uint8 *chunks_src,
+ *chunks_dst;
+
+ ENSURE_CHUNK_ARRAY_NODE(src);
+ ENSURE_CHUNK_ARRAY_NODE(dst);
+
+ /* Copy base type */
+ rt_copy_node_common(src, dst);
+
+ /* Copy chunk array */
+ chunks_src = chunk_array_node_get_chunks(src);
+ chunks_dst = chunk_array_node_get_chunks(dst);
+ memcpy(chunks_dst, chunks_src, sizeof(uint8) * src->count);
+
+ /* Copy children or values */
+ if (IS_LEAF_NODE(src))
+ {
+ uint64 *values_src,
+ *values_dst;
+
+ Assert(IS_LEAF_NODE(dst));
+ values_src = rt_node_get_leaf_values(src);
+ values_dst = rt_node_get_leaf_values(dst);
+ memcpy(values_dst, values_src, sizeof(uint64) * src->count);
+ }
+ else
+ {
+ rt_node **children_src,
+ **children_dst;
+
+ Assert(!IS_LEAF_NODE(dst));
+ children_src = rt_node_get_inner_children(src);
+ children_dst = rt_node_get_inner_children(dst);
+ memcpy(children_dst, children_src, sizeof(rt_node *) * src->count);
+ }
+}
+
+/*
+ * Return the index of the (sorted) chunk array where the chunk is inserted.
+ * Set true to replaced_p if the chunk already exists in the array.
+ */
+static int
+chunk_array_node_find_insert_pos(rt_node *node, uint8 chunk, bool *found_p)
+{
+ uint8 *chunks;
+ int idx;
+
+ ENSURE_CHUNK_ARRAY_NODE(node);
+
+ *found_p = false;
+ chunks = chunk_array_node_get_chunks(node);
+
+ /* Find the insert pos */
+ idx = search_chunk_array_le(chunks, chunk,
+ rt_node_info[node->kind].fanout,
+ node->count);
+
+ if (idx < node->count && chunks[idx] == chunk)
+ *found_p = true;
+
+ return idx;
+}
+
+/* Delete the chunk at idx */
+static void
+chunk_array_node_delete(rt_node *node, int idx)
+{
+ uint8 *chunks = chunk_array_node_get_chunks(node);
+
+ /* delete the chunk from the chunk array */
+ memmove(&(chunks[idx]), &(chunks[idx + 1]),
+ sizeof(uint8) * (node->count - idx - 1));
+
+ /* delete either the value or the child as well */
+ if (IS_LEAF_NODE(node))
+ {
+ uint64 *values = rt_node_get_leaf_values(node);
+
+ memmove(&(values[idx]),
+ &(values[idx + 1]),
+ sizeof(uint64) * (node->count - idx - 1));
+ }
+ else
+ {
+ rt_node **children = rt_node_get_inner_children(node);
+
+ memmove(&(children[idx]),
+ &(children[idx + 1]),
+ sizeof(rt_node *) * (node->count - idx - 1));
+ }
+}
+
+/* Support function for both node-128 */
+
+/* Does the given chunk in the node has the value? */
+static pg_attribute_always_inline bool
+node_128_is_chunk_used(rt_node_base_128 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static pg_attribute_always_inline bool
+node_128_is_slot_used(rt_node_base_128 *node, uint8 slot)
+{
+ return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+/* Get the pointer to either the child or the value corresponding to chunk */
+static void *
+node_128_get_slot(rt_node_base_128 *node, uint8 chunk)
+{
+ int slotpos;
+ void *slot;
+
+ slotpos = node->slot_idxs[chunk];
+ Assert(slotpos != RT_NODE_128_INVALID_IDX);
+
+ if (IS_LEAF_NODE(node))
+ slot = (void *) &(((rt_node_leaf_128 *) node)->values[slotpos]);
+ else
+ slot = (void *) (((rt_node_inner_128 *) node)->children[slotpos]);
+
+ return slot;
+}
+
+/* Delete the chunk in the node */
+static void
+node_128_delete(rt_node_base_128 *node, uint8 chunk)
+{
+ int slotpos = node->slot_idxs[chunk];
+
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+ node->slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+/* Return an unused slot in node-128 */
+static int
+node_128_find_unused_slot(rt_node_base_128 *node, uint8 chunk)
+{
+ int slotpos;
+
+ /*
+ * Find an unused slot. We iterate over the isset bitmap per byte then
+ * check each bit.
+ */
+ for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+ {
+ if (node->isset[slotpos] < 0xFF)
+ break;
+ }
+ Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+ slotpos *= BITS_PER_BYTE;
+ while (node_128_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+
+/* XXX: duplicate with node_128_set_leaf */
+static void
+node_128_set_inner(rt_node_base_128 *node, uint8 chunk, rt_node *child)
+{
+ int slotpos;
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+
+ /* Overwrite the existing value if exists */
+ if (node_128_is_chunk_used(node, chunk))
+ {
+ n128->children[n128->base.slot_idxs[chunk]] = child;
+ return;
+ }
+
+ /* find unused slot */
+ slotpos = node_128_find_unused_slot(node, chunk);
+
+ n128->base.slot_idxs[chunk] = slotpos;
+ n128->base.isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+ n128->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static void
+node_128_set_leaf(rt_node_base_128 *node, uint8 chunk, uint64 value)
+{
+ int slotpos;
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+
+ /* Overwrite the existing value if exists */
+ if (node_128_is_chunk_used(node, chunk))
+ {
+ n128->values[n128->base.slot_idxs[chunk]] = value;
+ return;
+ }
+
+ /* find unused slot */
+ slotpos = node_128_find_unused_slot(node, chunk);
+
+ n128->base.slot_idxs[chunk] = slotpos;
+ n128->base.isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+ n128->values[slotpos] = value;
+}
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static bool
+node_256_is_chunk_used(rt_node_base_256 *node, uint8 chunk)
+{
+ return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+/* Get the pointer to either the child or the value corresponding to chunk */
+static void *
+node_256_get_slot(rt_node_base_256 *node, uint8 chunk)
+{
+ void *slot;
+
+ Assert(node_256_is_chunk_used(node, chunk));
+ if (IS_LEAF_NODE(node))
+ slot = (void *) &(((rt_node_leaf_256 *) node)->values[chunk]);
+ else
+ slot = (void *) (((rt_node_inner_256 *) node)->children[chunk]);
+
+ return slot;
+}
+
+/* Set the child in the node-256 */
+static pg_attribute_always_inline void
+node_256_set_inner(rt_node_base_256 *node, uint8 chunk, rt_node *child)
+{
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ n256->base.isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+ n256->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static pg_attribute_always_inline void
+node_256_set_leaf(rt_node_base_256 *node, uint8 chunk, uint64 value)
+{
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ n256->base.isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+ n256->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static pg_attribute_always_inline void
+node_256_delete(rt_node_base_256 *node, uint8 chunk)
+{
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static pg_attribute_always_inline int
+key_get_shift(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_node_kind kind, bool inner)
+{
+ rt_node *newnode;
+
+ if (inner)
+ newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_info[kind].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_info[kind].leaf_size);
+
+ newnode->kind = kind;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_128)
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) newnode;
+
+ memset(n128->slot_idxs, RT_NODE_128_INVALID_IDX, sizeof(n128->slot_idxs));
+ }
+
+ /* update the statistics */
+ tree->cnt[kind]++;
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->root == node)
+ tree->root = NULL;
+
+ /* update the statistics */
+ tree->cnt[node->kind]--;
+
+ Assert(tree->cnt[node->kind] >= 0);
+
+ pfree(node);
+}
+
+/* Copy the common fields without the node kind */
+static void
+rt_copy_node_common(rt_node *src, rt_node *dst)
+{
+ dst->shift = src->shift;
+ dst->chunk = src->chunk;
+ dst->count = src->count;
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+ int target_shift;
+ int shift = tree->root->shift + RT_NODE_SPAN;
+
+ target_shift = key_get_shift(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ rt_node_inner_4 *node =
+ (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4, true);
+
+ node->base.n.count = 1;
+ node->base.n.shift = shift;
+ node->base.chunks[0] = 0;
+ node->children[0] = tree->root;
+
+ tree->root->chunk = 0;
+ tree->root = (rt_node *) node;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * Wrapper for rt_node_search to search the pointer to the child node in the
+ * node.
+ *
+ * Return true if the corresponding child is found, otherwise return false. On success,
+ * it sets child_p.
+ */
+static bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+ rt_node *child;
+
+ if (!rt_node_search(node, key, action, (void **) &child))
+ return false;
+
+ if (child_p)
+ *child_p = child;
+
+ return true;
+}
+
+static bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+ uint64 *value;
+
+ if (!rt_node_search(node, key, action, (void **) &value))
+ return false;
+
+ if (value_p)
+ *value_p = *value;
+
+ return true;
+}
+
+/*
+ * Return true if the corresponding slot is used, otherwise return false. On success,
+ * sets the pointer to the slot to slot_p.
+ */
+static bool
+rt_node_search(rt_node *node, uint64 key, rt_action action, void **slot_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ case RT_NODE_KIND_16:
+ case RT_NODE_KIND_32:
+ {
+ int idx;
+ uint8 *chunks = chunk_array_node_get_chunks(node);
+
+ idx = search_chunk_array_eq(chunks, chunk,
+ rt_node_info[node->kind].fanout,
+ node->count);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ *slot_p = chunk_array_node_get_slot(node, idx);
+ else /* RT_ACTION_DELETE */
+ chunk_array_node_delete(node, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+
+ /* If we find the chunk in the node, do the specified action */
+ if (node_128_is_chunk_used(n128, chunk))
+ {
+ if (action == RT_ACTION_FIND)
+ *slot_p = node_128_get_slot(n128, chunk);
+ else /* RT_ACTION_DELETE */
+ node_128_delete(n128, chunk);
+
+ found = true;
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+
+ /* If we find the chunk in the node, do the specified action */
+ if (node_256_is_chunk_used(n256, chunk))
+ {
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ *slot_p = node_256_get_slot(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_256_delete(n256, chunk);
+ }
+
+ break;
+ }
+ }
+
+ /* Update the statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ return found;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+ int shift = key_get_shift(key);
+ rt_node *node;
+
+ node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift > 0);
+ node->shift = shift;
+ tree->max_val = shift_get_max_val(shift);
+ tree->root = node;
+}
+
+/* Insert 'node' as a child node of 'parent' */
+static rt_node *
+rt_node_add_new_child(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key)
+{
+ uint8 newshift = node->shift - RT_NODE_SPAN;
+ rt_node *newchild =
+ (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, newshift > 0);
+
+ Assert(!IS_LEAF_NODE(node));
+
+ newchild->shift = newshift;
+ newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+ rt_node_insert_inner(tree, parent, node, key, newchild, NULL);
+
+ return (rt_node *) newchild;
+}
+
+/*
+ * For upcoming insertions, we make sure that the node has enough free slots or
+ * grow the node if necessary. We set true to will_replace_p if the chunk
+ * already exists and will be replaced on insertion.
+ *
+ * Return the index in the chunk array where the key can be inserted. We always
+ * return 0 in node-128 and node-256 cases.
+ */
+static int
+rt_node_prepare_insert(radix_tree *tree, rt_node *parent, rt_node **node_p,
+ uint64 key, bool *will_replace_p)
+{
+ rt_node *node = *node_p;
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool will_replace = false;
+ int idx = 0;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ case RT_NODE_KIND_16:
+ case RT_NODE_KIND_32:
+ {
+ bool can_insert = false;
+
+ while ((node->kind == RT_NODE_KIND_4) ||
+ (node->kind == RT_NODE_KIND_16) ||
+ (node->kind == RT_NODE_KIND_32))
+ {
+ /* Find the insert pos */
+ idx = chunk_array_node_find_insert_pos(node, chunk, &will_replace);
+
+ if (will_replace || NODE_HAS_FREE_SLOT(node))
+ {
+ can_insert = true;
+ break;
+ }
+
+ node = rt_node_grow(tree, parent, node, key);
+ }
+
+ if (can_insert)
+ {
+ uint8 *chunks = chunk_array_node_get_chunks(node);
+
+ /*
+ * The node has unused slot for this chunk. If the key
+ * needs to be inserted in the middle of the array, we
+ * make space for the new key.
+ */
+ if (!will_replace && node->count != 0 && idx != node->count)
+ {
+ memmove(&(chunks[idx + 1]), &(chunks[idx]),
+ sizeof(uint8) * (node->count - idx));
+
+ /* shift either the values array or the children array */
+ if (IS_LEAF_NODE(node))
+ {
+ uint64 *values = rt_node_get_leaf_values(node);
+
+ memmove(&(values[idx + 1]),
+ &(values[idx]),
+ sizeof(uint64) * (node->count - idx));
+ }
+ else
+ {
+ rt_node **children = rt_node_get_inner_children(node);
+
+ memmove(&(children[idx + 1]),
+ &(children[idx]),
+ sizeof(rt_node *) * (node->count - idx));
+ }
+ }
+
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+
+ if (node_128_is_chunk_used(n128, chunk) || NODE_HAS_FREE_SLOT(n128))
+ {
+ if (node_128_is_chunk_used(n128, chunk))
+ will_replace = true;
+
+ break;
+ }
+
+ node = rt_node_grow(tree, parent, node, key);
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+
+ if (node_256_is_chunk_used(n256, chunk))
+ will_replace = true;
+
+ break;
+ }
+ }
+
+ *node_p = node;
+ *will_replace_p = will_replace;
+
+ return idx;
+}
+
+/* Insert the child to the inner node */
+static void
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, rt_node *child, bool *replaced_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ int idx;
+ bool replaced;
+
+ Assert(!IS_LEAF_NODE(node));
+
+ idx = rt_node_prepare_insert(tree, parent, &node, key, &replaced);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ case RT_NODE_KIND_16:
+ case RT_NODE_KIND_32:
+ {
+ uint8 *chunks = chunk_array_node_get_chunks(node);
+ rt_node **children = rt_node_get_inner_children(node);
+
+ Assert(idx >= 0);
+ chunks[idx] = chunk;
+ children[idx] = child;
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ node_128_set_inner((rt_node_base_128 *) node, chunk, child);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ node_256_set_inner((rt_node_base_256 *) node, chunk, child);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!replaced)
+ node->count++;
+
+ if (replaced_p)
+ *replaced_p = replaced;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+}
+
+/* Insert the value to the leaf node */
+static void
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value, bool *replaced_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ int idx;
+ bool replaced;
+
+ Assert(IS_LEAF_NODE(node));
+
+ idx = rt_node_prepare_insert(tree, parent, &node, key, &replaced);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ case RT_NODE_KIND_16:
+ case RT_NODE_KIND_32:
+ {
+ uint8 *chunks = chunk_array_node_get_chunks(node);
+ uint64 *values = rt_node_get_leaf_values(node);
+
+ Assert(idx >= 0);
+ chunks[idx] = chunk;
+ values[idx] = value;
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ node_128_set_leaf((rt_node_base_128 *) node, chunk, value);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ node_256_set_leaf((rt_node_base_256 *) node, chunk, value);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!replaced)
+ node->count++;
+
+ *replaced_p = replaced;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+}
+
+/* Change the node type to the next larger one */
+static rt_node *
+rt_node_grow(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key)
+{
+ rt_node *newnode = NULL;
+
+ Assert(node->count == rt_node_info[node->kind].fanout);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ newnode = rt_alloc_node(tree, RT_NODE_KIND_16,
+ IS_LEAF_NODE(node));
+
+ /* Copy both chunks and slots to the new node */
+ chunk_array_node_copy_chunks_and_slots(node, newnode);
+ break;
+ }
+ case RT_NODE_KIND_16:
+ {
+ newnode = rt_alloc_node(tree, RT_NODE_KIND_32,
+ IS_LEAF_NODE(node));
+
+ /* Copy both chunks and slots to the new node */
+ chunk_array_node_copy_chunks_and_slots(node, newnode);
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ newnode = rt_alloc_node(tree, RT_NODE_KIND_128,
+ IS_LEAF_NODE(node));
+
+ /* Copy both chunks and slots to the new node */
+ rt_copy_node_common(node, newnode);
+
+ if (IS_LEAF_NODE(node))
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+ for (int i = 0; i < node->count; i++)
+ node_128_set_leaf((rt_node_base_128 *) newnode,
+ n32->base.chunks[i], n32->values[i]);
+ }
+ else
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+ for (int i = 0; i < node->count; i++)
+ node_128_set_inner((rt_node_base_128 *) newnode,
+ n32->base.chunks[i], n32->children[i]);
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+ int cnt = 0;
+
+ newnode = rt_alloc_node(tree, RT_NODE_KIND_256,
+ IS_LEAF_NODE(node));
+
+ /* Copy both chunks and slots to the new node */
+ rt_copy_node_common(node, newnode);
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->n.count; i++)
+ {
+ void *slot;
+
+ if (!node_128_is_chunk_used(n128, i))
+ continue;
+
+ slot = node_128_get_slot(n128, i);
+
+ if (IS_LEAF_NODE(node))
+ node_256_set_leaf((rt_node_base_256 *) newnode, i,
+ *(uint64 *) slot);
+ else
+ node_256_set_inner((rt_node_base_256 *) newnode, i,
+ (rt_node *) slot);
+
+ cnt++;
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ elog(ERROR, "radix tree node-256 cannot grow");
+ break;
+ }
+
+ if (parent == node)
+ {
+ /* Replace the root node with the new large node */
+ tree->root = newnode;
+ }
+ else
+ {
+ /* Set the new node to the parent node */
+ rt_node_insert_inner(tree, NULL, parent, key, newnode, NULL);
+ }
+
+ /* Verify the node has grown properly */
+ rt_verify_node(newnode);
+
+ /* Free the old node */
+ rt_free_node(tree, node);
+
+ return newnode;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+ radix_tree *tree;
+ MemoryContext old_ctx;
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = palloc(sizeof(radix_tree));
+ tree->context = ctx;
+ tree->root = NULL;
+ tree->max_val = 0;
+ tree->num_keys = 0;
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_node_info[i].name,
+ SLAB_DEFAULT_BLOCK_SIZE,
+ rt_node_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_node_info[i].name,
+ SLAB_DEFAULT_BLOCK_SIZE,
+ rt_node_info[i].leaf_size);
+ tree->cnt[i] = 0;
+ }
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry exists, we update its value to 'value' and return
+ * true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+ int shift;
+ bool replaced;
+ rt_node *node;
+ rt_node *parent = tree->root;
+
+ /* Empty tree, create the root */
+ if (!tree->root)
+ rt_new_root(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->max_val)
+ rt_extend(tree, key);
+
+ Assert(tree->root);
+
+ shift = tree->root->shift;
+ node = tree->root;
+
+ while (shift > 0)
+ {
+ rt_node *child;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ child = rt_node_add_new_child(tree, parent, node, key);
+
+ Assert(child);
+
+ parent = node;
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* arrived at a leaf */
+ Assert(IS_LEAF_NODE(node));
+
+ rt_node_insert_leaf(tree, parent, node, key, value, &replaced);
+
+ /* Update the statistics */
+ if (!replaced)
+ tree->num_keys++;
+
+ return replaced;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if the key is successfully
+ * found, otherwise return false. On success, we set the value to *val_p so
+ * it must not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+ rt_node *node;
+ int shift;
+
+ Assert(value_p);
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift > 0)
+ {
+ rt_node *child;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* We reached at a leaf node, search the corresponding slot */
+ Assert(IS_LEAF_NODE(node));
+
+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p))
+ return false;
+
+ return true;
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ int shift;
+ rt_node *stack[RT_MAX_LEVEL] = {0};
+ int level;
+ bool deleted;
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ /*
+ * Descending the tree to search the key while building a stack of nodes
+ * we visited.
+ */
+ node = tree->root;
+ shift = tree->root->shift;
+ level = 0;
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ /* Push the current node to the stack */
+ stack[level] = node;
+
+ if (IS_LEAF_NODE(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+
+ /*
+ * Delete the key from the leaf node and recursively delete internal nodes
+ * if necessary.
+ */
+ Assert(IS_LEAF_NODE(stack[level]));
+ while (level >= 0)
+ {
+ rt_node *node = stack[level--];
+
+ if (IS_LEAF_NODE(node))
+ deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+ else
+ deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!IS_EMPTY_NODE(node))
+ break;
+
+ Assert(deleted);
+
+ /* The node became empty */
+ rt_free_node(tree, node);
+
+ }
+
+ /*
+ * If we eventually deleted the root node while recursively deleting empty
+ * nodes, we make the tree empty.
+ */
+ if (level == 0)
+ {
+ tree->root = NULL;
+ tree->max_val = 0;
+ }
+
+ if (deleted)
+ tree->num_keys--;
+
+ return deleted;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+ MemoryContext old_ctx;
+ rt_iter *iter;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (rt_iter *) palloc0(sizeof(rt_iter));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree)
+ return iter;
+
+ top_level = iter->tree->root->shift / RT_NODE_SPAN;
+
+ iter->stack_len = top_level;
+ iter->stack[top_level].node = iter->tree->root;
+ iter->stack[top_level].current_idx = -1;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ rt_update_iter_stack(iter, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Update the stack of the radix tree node while descending to the leaf from
+ * the 'from' level.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, int from)
+{
+ rt_node *node = iter->stack[from].node;
+ int level = from;
+
+ for (;;)
+ {
+ rt_iter_node_data *node_iter = &(iter->stack[level--]);
+ bool found;
+
+ /* Set the node to this level */
+ rt_store_iter_node(iter, node_iter, node);
+
+ /* Finish if we reached to the leaf node */
+ if (IS_LEAF_NODE(node))
+ break;
+
+ /* Advance to the next slot in the node */
+ node = (rt_node *) rt_node_iterate_next(iter, node_iter, &found);
+
+ /*
+ * Since we always get the first slot in the node, we have to found
+ * the slot.
+ */
+ Assert(found);
+ }
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+ bool found = false;
+ void *slot;
+
+ /* Empty tree */
+ if (!iter->tree)
+ return false;
+
+ for (;;)
+ {
+ rt_node *node;
+ rt_iter_node_data *node_iter;
+ int level;
+
+ /*
+ * Iterate node at each level from the bottom of the tree, i.e., the
+ * lead node, until we find the next slot.
+ */
+ for (level = 0; level <= iter->stack_len; level++)
+ {
+ slot = rt_node_iterate_next(iter, &(iter->stack[level]), &found);
+
+ if (found)
+ break;
+ }
+
+ /* We could not find any new key-value pair, the iteration finished */
+ if (!found)
+ break;
+
+ /* found the next slot at the leaf node, return it */
+ if (level == 0)
+ {
+ *key_p = iter->key;
+ *value_p = *((uint64 *) slot);
+ break;
+ }
+
+ /*
+ * We have advanced slots more than one nodes including both the lead
+ * node and internal nodes. So we update the stack by descending to
+ * the left most leaf node from this level.
+ */
+ node = (rt_node *) (rt_node *) slot;
+ node_iter = &(iter->stack[level - 1]);
+ rt_store_iter_node(iter, node_iter, node);
+ rt_update_iter_stack(iter, level - 1);
+ }
+
+ return found;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+ pfree(iter);
+}
+
+/*
+ * Iterate over the given radix tree node and returns the next slot of the given
+ * node and set true to *found_p, if any. Otherwise, set false to *found_p.
+ */
+static void *
+rt_node_iterate_next(rt_iter *iter, rt_iter_node_data *node_iter, bool *found_p)
+{
+ rt_node *node = node_iter->node;
+ void *slot = NULL;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ case RT_NODE_KIND_16:
+ case RT_NODE_KIND_32:
+ {
+ node_iter->current_idx++;
+
+ if (node_iter->current_idx >= node->count)
+ goto not_found;
+
+ slot = chunk_array_node_get_slot(node, node_iter->current_idx);
+
+ /* Update the part of the key by the current chunk */
+ if (IS_LEAF_NODE(node))
+ {
+ uint8 *chunks = chunk_array_node_get_chunks(node);
+
+ rt_iter_update_key(iter, chunks[node_iter->current_idx], 0);
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < 256; i++)
+ {
+ if (node_128_is_chunk_used(n128, i))
+ break;
+ }
+
+ if (i >= 256)
+ goto not_found;
+
+ node_iter->current_idx = i;
+ slot = node_128_get_slot(n128, i);
+
+ /* Update the part of the key */
+ if (IS_LEAF_NODE(n128))
+ rt_iter_update_key(iter, node_iter->current_idx, 0);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < 256; i++)
+ {
+ if (node_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= 256)
+ goto not_found;
+
+ node_iter->current_idx = i;
+ slot = node_256_get_slot(n256, i);
+
+ /* Update the part of the key */
+ if (IS_LEAF_NODE(n256))
+ rt_iter_update_key(iter, node_iter->current_idx, 0);
+
+ break;
+ }
+ }
+
+ Assert(slot);
+ *found_p = true;
+ return slot;
+
+not_found:
+ *found_p = false;
+ return NULL;
+}
+
+/*
+ * Set the node to the node_iter so we can begin the iteration of the node.
+ * Also, we update the part of the key by the chunk of the given node.
+ */
+static void
+rt_store_iter_node(rt_iter *iter, rt_iter_node_data *node_iter,
+ rt_node *node)
+{
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ rt_iter_update_key(iter, node->chunk, node->shift + RT_NODE_SPAN);
+}
+
+static pg_attribute_always_inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+ return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+ Size total = 0;
+
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ case RT_NODE_KIND_16:
+ case RT_NODE_KIND_32:
+ {
+ uint8 *chunks = chunk_array_node_get_chunks(node);
+
+ /* Check if the chunks in the node are sorted */
+ for (int i = 1; i < node->count; i++)
+ Assert(chunks[i - 1] < chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_128_is_chunk_used(n128, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(node_128_is_slot_used(n128, n128->slot_idxs[i]));
+
+ cnt++;
+ }
+
+ Assert(n128->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+ cnt += pg_popcount32(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->n.count == cnt);
+
+ break;
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+ fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u, n16 = %u,n32 = %u, n128 = %u, n256 = %u",
+ tree->num_keys,
+ tree->root->shift / RT_NODE_SPAN,
+ tree->cnt[0],
+ tree->cnt[1],
+ tree->cnt[2],
+ tree->cnt[3],
+ tree->cnt[4]);
+ /* rt_dump(tree); */
+}
+
+static void
+rt_print_slot(StringInfo buf, uint8 chunk, uint64 value, int idx, bool is_leaf, int level)
+{
+ char space[128] = {0};
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ if (is_leaf)
+ appendStringInfo(buf, "%s[%d] \"0x%X\" val(0x%lX) LEAF\n",
+ space,
+ idx,
+ chunk,
+ value);
+ else
+ appendStringInfo(buf, "%s[%d] \"0x%X\" -> ",
+ space,
+ idx,
+ chunk);
+}
+
+static void
+rt_dump_node(rt_node *node, int level, StringInfo buf, bool recurse)
+{
+ bool is_leaf = IS_LEAF_NODE(node);
+
+ appendStringInfo(buf, "[\"%s\" type %d, cnt %u, shift %u, chunk \"0x%X\"] chunks:\n",
+ IS_LEAF_NODE(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_128) ? 128 : 256,
+ node->count, node->shift, node->chunk);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ case RT_NODE_KIND_16:
+ case RT_NODE_KIND_32:
+ {
+ uint8 *chunks = chunk_array_node_get_chunks(node);
+
+ for (int i = 0; i < node->count; i++)
+ {
+ if (IS_LEAF_NODE(node))
+ {
+ uint64 *values = rt_node_get_leaf_values(node);
+
+ rt_print_slot(buf, chunks[i],
+ values[i],
+ i, is_leaf, level);
+ }
+ else
+ rt_print_slot(buf, chunks[i],
+ UINT64_MAX,
+ i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ rt_node **children = rt_node_get_inner_children(node);
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ rt_dump_node(children[i],
+ level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+ uint8 *tmp = (uint8 *) n128->isset;
+
+ appendStringInfo(buf, "slot_idxs:");
+ for (int j = 0; j < 256; j++)
+ {
+ if (!node_128_is_chunk_used(n128, j))
+ continue;
+
+ appendStringInfo(buf, " [%d]=%d, ", j, n128->slot_idxs[j]);
+ }
+ appendStringInfo(buf, "\nisset-bitmap:");
+ for (int j = 0; j < 16; j++)
+ {
+ appendStringInfo(buf, "%X ", (uint8) tmp[j]);
+ }
+ appendStringInfo(buf, "\n");
+
+ for (int i = 0; i < 256; i++)
+ {
+ void *slot;
+
+ if (!node_128_is_chunk_used(n128, i))
+ continue;
+
+ slot = node_128_get_slot(n128, i);
+
+ if (is_leaf)
+ rt_print_slot(buf, i, *(uint64 *) slot,
+ i, is_leaf, level);
+ else
+ rt_print_slot(buf, i, UINT64_MAX, i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ rt_dump_node((rt_node *) slot,
+ level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+
+ for (int i = 0; i < 256; i++)
+ {
+ void *slot;
+
+ if (!node_256_is_chunk_used(n256, i))
+ continue;
+
+ slot = node_256_get_slot(n256, i);
+
+ if (is_leaf)
+ rt_print_slot(buf, i, *(uint64 *) slot, i, is_leaf, level);
+ else
+ rt_print_slot(buf, i, UINT64_MAX, i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ rt_dump_node((rt_node *) slot, level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+ StringInfoData buf;
+ rt_node *node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+ if (!tree->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->max_val)
+ {
+ elog(NOTICE, "key %lu (0x%lX) is larger than max val",
+ key, key);
+ return;
+ }
+
+ initStringInfo(&buf);
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ rt_dump_node(node, level, &buf, false);
+
+ if (IS_LEAF_NODE(node))
+ {
+ uint64 dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+ break;
+ }
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+
+ elog(NOTICE, "\n%s", buf.data);
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+ StringInfoData buf;
+
+ initStringInfo(&buf);
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = %lu", tree->max_val);
+ rt_dump_node(tree->root, 0, &buf, true);
+ elog(NOTICE, "\n%s", buf.data);
+ elog(NOTICE, "-----------------------------------------------------------");
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..788eb13204
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+extern void rt_free(radix_tree *tree);
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif /* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9090226daa..51b2514faf 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -24,6 +24,7 @@ SUBDIRS = \
test_parser \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..cc6970c87c
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,28 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..671c3e0f47
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,507 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+/* The maximum number of entries each node type can have */
+static int rt_node_max_entries[] = {
+ 4, /* RT_NODE_KIND_4 */
+ 16, /* RT_NODE_KIND_16 */
+ 32, /* RT_NODE_KIND_32 */
+ 128, /* RT_NODE_KIND_128 */
+ 256 /* RT_NODE_KIND_256 */
+};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 10000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ radix_tree *radixtree;
+ uint64 dummy;
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ uint64 val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != key)
+ {
+ rt_dump(radixtree);
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+ }
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, key);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+ for (int j = 0; j < lengthof(rt_node_max_entries); j++)
+ {
+ /*
+ * After filling all slots in each node type, check if the values are
+ * stored properly.
+ */
+ if (i == (rt_node_max_entries[j] - 1))
+ {
+ check_search_on_node(radixtree, shift,
+ (j == 0) ? 0 : rt_node_max_entries[j - 1],
+ rt_node_max_entries[j]);
+ break;
+ }
+ }
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "inserted key 0x" UINT64_HEX_FORMAT " is not found", key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ radix_tree *radixtree;
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search
+ * entries again.
+ */
+ test_node_types_insert(radixtree, shift);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift);
+
+ rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec *spec)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+ radixtree = rt_create(radixtree_ctx);
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the
+ * stats from the memory context. They should be in the same ballpark,
+ * but it's hard to automate testing that, so if you're making changes to
+ * the implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Jul 19, 2022 at 1:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I’d like to keep the first version simple. We can improve it and add
more optimizations later. Using radix tree for vacuum TID storage
would still be a big win comparing to using a flat array, even without
all these optimizations. In terms of single-value leaves method, I'm
also concerned about an extra pointer traversal and extra memory
allocation. It's most flexible but multi-value leaves method is also
flexible enough for many use cases. Using the single-value method
seems to be too much as the first step for me.Overall, using 64-bit keys and 64-bit values would be a reasonable
choice for me as the first step . It can cover wider use cases
including vacuum TID use cases. And possibly it can cover use cases by
combining a hash table or using tree of tree, for example.These two aspects would also bring it closer to Andres' prototype, which 1) makes review easier and 2) easier to preserve optimization work already done, so +1 from me.
Thanks.
I've updated the patch. It now implements 64-bit keys, 64-bit values,
and the multi-value leaves method. I've tried to remove duplicated
codes but we might find a better way to do that.
With the recent changes related to simd, I'm going to split the patch
into at least two parts: introduce other simd optimized functions used
by the radix tree and the radix tree implementation. Particularly we
need two functions for radix tree: a function like pg_lfind32 but for
8 bits integers and return the index, and a function that returns the
index of the first element that is >= key.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Mon, Aug 15, 2022 at 12:39 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Jul 19, 2022 at 1:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I’d like to keep the first version simple. We can improve it and add
more optimizations later. Using radix tree for vacuum TID storage
would still be a big win comparing to using a flat array, even without
all these optimizations. In terms of single-value leaves method, I'm
also concerned about an extra pointer traversal and extra memory
allocation. It's most flexible but multi-value leaves method is also
flexible enough for many use cases. Using the single-value method
seems to be too much as the first step for me.Overall, using 64-bit keys and 64-bit values would be a reasonable
choice for me as the first step . It can cover wider use cases
including vacuum TID use cases. And possibly it can cover use cases by
combining a hash table or using tree of tree, for example.These two aspects would also bring it closer to Andres' prototype, which 1) makes review easier and 2) easier to preserve optimization work already done, so +1 from me.
Thanks.
I've updated the patch. It now implements 64-bit keys, 64-bit values,
and the multi-value leaves method. I've tried to remove duplicated
codes but we might find a better way to do that.With the recent changes related to simd, I'm going to split the patch
into at least two parts: introduce other simd optimized functions used
by the radix tree and the radix tree implementation. Particularly we
need two functions for radix tree: a function like pg_lfind32 but for
8 bits integers and return the index, and a function that returns the
index of the first element that is >= key.
I recommend looking at
/messages/by-id/CAFBsxsESLUyJ5spfOSyPrOvKUEYYNqsBosue9SV1j8ecgNXSKA@mail.gmail.com
since I did the work just now for searching bytes and returning a
bool, buth = and <=. Should be pretty close. Also, i believe if you
left this for last as a possible refactoring, it might save some work.
In any case, I'll take a look at the latest patch next month.
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Aug 15, 2022 at 10:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Mon, Aug 15, 2022 at 12:39 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Jul 19, 2022 at 1:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I’d like to keep the first version simple. We can improve it and add
more optimizations later. Using radix tree for vacuum TID storage
would still be a big win comparing to using a flat array, even without
all these optimizations. In terms of single-value leaves method, I'm
also concerned about an extra pointer traversal and extra memory
allocation. It's most flexible but multi-value leaves method is also
flexible enough for many use cases. Using the single-value method
seems to be too much as the first step for me.Overall, using 64-bit keys and 64-bit values would be a reasonable
choice for me as the first step . It can cover wider use cases
including vacuum TID use cases. And possibly it can cover use cases by
combining a hash table or using tree of tree, for example.These two aspects would also bring it closer to Andres' prototype, which 1) makes review easier and 2) easier to preserve optimization work already done, so +1 from me.
Thanks.
I've updated the patch. It now implements 64-bit keys, 64-bit values,
and the multi-value leaves method. I've tried to remove duplicated
codes but we might find a better way to do that.With the recent changes related to simd, I'm going to split the patch
into at least two parts: introduce other simd optimized functions used
by the radix tree and the radix tree implementation. Particularly we
need two functions for radix tree: a function like pg_lfind32 but for
8 bits integers and return the index, and a function that returns the
index of the first element that is >= key.I recommend looking at
/messages/by-id/CAFBsxsESLUyJ5spfOSyPrOvKUEYYNqsBosue9SV1j8ecgNXSKA@mail.gmail.com
since I did the work just now for searching bytes and returning a
bool, buth = and <=. Should be pretty close. Also, i believe if you
left this for last as a possible refactoring, it might save some work.
In any case, I'll take a look at the latest patch next month.
I've updated the radix tree patch. It's now separated into two patches.
0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find
better names) that are similar to the pg_lfind8() family but they
return the index of the key in the vector instead of true/false. The
patch includes regression tests.
0002 patch is the main radix tree implementation. I've removed some
duplicated codes of node manipulation. For instance, since node-4,
node-16, and node-32 have a similar structure with different fanouts,
I introduced the common function for them.
In addition to two patches, I've attached the third patch. It's not
part of radix tree implementation but introduces a contrib module
bench_radix_tree, a tool for radix tree performance benchmarking. It
measures loading and lookup performance of both the radix tree and a
flat array.
Regards,
--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v6-0001-Support-pg_lsearch8_eq-and-pg_lsearch8_ge.patchapplication/x-patch; name=v6-0001-Support-pg_lsearch8_eq-and-pg_lsearch8_ge.patchDownload
From 5d0115b068ecb01d791eab5f8a78a6d25b9cf45c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:01 +0000
Subject: [PATCH v6 1/3] Support pg_lsearch8_eq and pg_lsearch8_ge
---
src/include/port/pg_lfind.h | 71 ++++++++
src/include/port/simd.h | 155 +++++++++++++++++-
.../test_lfind/expected/test_lfind.out | 12 ++
.../modules/test_lfind/sql/test_lfind.sql | 2 +
.../modules/test_lfind/test_lfind--1.0.sql | 8 +
src/test/modules/test_lfind/test_lfind.c | 139 ++++++++++++++++
6 files changed, 378 insertions(+), 9 deletions(-)
diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index 0625cac6b5..583f204763 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -80,6 +80,77 @@ pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
return false;
}
+/*
+ * pg_lsearch8
+ *
+ * Return the index of the element in 'base' that equals to 'key', otherwise return
+ * -1.
+ */
+static inline int
+pg_lsearch8(uint8 key, uint8 *base, uint32 nelem)
+{
+ uint32 i;
+
+ /* round down to multiple of vector length */
+ uint32 tail_idx = nelem & ~(sizeof(Vector8) - 1);
+ Vector8 chunk;
+
+ for (i = 0; i < tail_idx; i += sizeof(Vector8))
+ {
+ int idx;
+
+ vector8_load(&chunk, &base[i]);
+ if ((idx = vector8_search_eq(chunk, key)) != -1)
+ return i + idx;
+ }
+
+ /* Process the remaining elements one at a time. */
+ for (; i < nelem; i++)
+ {
+ if (key == base[i])
+ return i;
+ }
+
+ return -1;
+}
+
+
+/*
+ * pg_lsearch8_ge
+ *
+ * Return the index of the first element in 'base' that is greater than or equal to
+ * 'key'. Return nelem if there is no such element.
+ *
+ * Note that this function assumes the elements in 'base' are sorted.
+ */
+static inline int
+pg_lsearch8_ge(uint8 key, uint8 *base, uint32 nelem)
+{
+ uint32 i;
+
+ /* round down to multiple of vector length */
+ uint32 tail_idx = nelem & ~(sizeof(Vector8) - 1);
+ Vector8 chunk;
+
+ for (i = 0; i < tail_idx; i += sizeof(Vector8))
+ {
+ int idx;
+
+ vector8_load(&chunk, &base[i]);
+ if ((idx = vector8_search_ge(chunk, key)) != sizeof(Vector8))
+ return i + idx;
+ }
+
+ /* Process the remaining elements one at a time. */
+ for (; i < nelem; i++)
+ {
+ if (base[i] >= key)
+ break;
+ }
+
+ return i;
+}
+
/*
* pg_lfind32
*
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..e2a99578a5 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -18,6 +18,8 @@
#ifndef SIMD_H
#define SIMD_H
+#include "port/pg_bitutils.h"
+
#if (defined(__x86_64__) || defined(_M_AMD64))
/*
* SSE2 instructions are part of the spec for the 64-bit x86 ISA. We assume
@@ -88,14 +90,9 @@ static inline Vector32 vector32_or(const Vector32 v1, const Vector32 v2);
static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
#endif
-/*
- * comparisons between vectors
- *
- * Note: These return a vector rather than boolean, which is why we don't
- * have non-SIMD implementations.
- */
-#ifndef USE_NO_SIMD
+/* comparisons between vectors */
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+#ifndef USE_NO_SIMD
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -277,6 +274,140 @@ vector8_is_highbit_set(const Vector8 v)
#endif
}
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+ uint32 mask = 0;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+ return mask;
+#endif
+}
+
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#else /* USE_NO_SIMD */
+ Vector8 r = 0;
+ uint8 *rp = (uint8 *) &r;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]);
+
+ return r;
+#endif
+}
+
+/*
+ * Return the index of the element in the vector that equal to the given
+ * scalar. Otherwise, return -1.
+ */
+static inline int
+vector8_search_eq(const Vector8 v, const uint8 c)
+{
+ Vector8 keys = vector8_broadcast(c);
+ Vector8 cmp;
+ uint32 mask;
+ int result;
+
+#ifdef USE_ASSERT_CHECKING
+ int assert_result = -1;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ {
+ if (((const uint8 *) &v)[i] == c)
+ {
+ assert_result = i;
+ break;
+ }
+ }
+#endif /* USE_ASSERT_CHECKING */
+
+ cmp = vector8_eq(keys, v);
+ mask = vector8_highbit_mask(cmp);
+
+ if (mask)
+ result = pg_rightmost_one_pos32(mask);
+ else
+ result = -1;
+
+ Assert(assert_result == result);
+ return result;
+}
+
+/*
+ * Return the index of the first element in the vector that is greater than
+ * or eual to the given scalar. Return sizeof(Vector8) if there is no such
+ * element.
+ *
+ * Note that this function assumes the elements in the vector are sorted.
+ */
+static inline int
+vector8_search_ge(const Vector8 v, const uint8 c)
+{
+ Vector8 keys = vector8_broadcast(c);
+ Vector8 min;
+ Vector8 cmp;
+ uint32 mask;
+ int result;
+
+#ifdef USE_ASSERT_CHECKING
+ int assert_result = -1;
+ Size i;
+
+ for (i = 0; i < sizeof(Vector8); i++)
+ {
+ if (((const uint8 *) &v)[i] >= c)
+ break;
+ }
+ assert_result = i;
+#endif /* USE_ASSERT_CHECKING */
+
+ /*
+ * There is a bit more complicated than vector8_search_eq(), because
+ * until recently no unsigned uint8 compasion instruction existed.
+ * Therefore, we need to use vector8_min() to effectively get <= elements.
+ */
+ min = vector8_min(v, keys);
+ cmp = vector8_eq(keys, min);
+ mask = vector8_highbit_mask(cmp);
+
+ if (mask)
+ result = pg_rightmost_one_pos32(mask);
+ else
+ result = sizeof(Vector8);
+
+ Assert(assert_result == result);
+ return result;
+}
+
/*
* Exactly like vector8_is_highbit_set except for the input type, so it
* looks at each byte separately.
@@ -348,7 +479,6 @@ vector8_ssub(const Vector8 v1, const Vector8 v2)
* Return a vector with all bits set in each lane where the the corresponding
* lanes in the inputs are equal.
*/
-#ifndef USE_NO_SIMD
static inline Vector8
vector8_eq(const Vector8 v1, const Vector8 v2)
{
@@ -356,9 +486,16 @@ vector8_eq(const Vector8 v1, const Vector8 v2)
return _mm_cmpeq_epi8(v1, v2);
#elif defined(USE_NEON)
return vceqq_u8(v1, v2);
+#else /* USE_NO_SIMD */
+ Vector8 r = 0;
+ uint8 *rp = (uint8 *) &r;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ rp[i] = (((const uint8 *) &v1)[i] == ((const uint8 *) &v2)[i]) ? 0xFF : 0;
+
+ return r;
#endif
}
-#endif /* ! USE_NO_SIMD */
#ifndef USE_NO_SIMD
static inline Vector32
diff --git a/src/test/modules/test_lfind/expected/test_lfind.out b/src/test/modules/test_lfind/expected/test_lfind.out
index 1d4b14e703..9416161955 100644
--- a/src/test/modules/test_lfind/expected/test_lfind.out
+++ b/src/test/modules/test_lfind/expected/test_lfind.out
@@ -22,3 +22,15 @@ SELECT test_lfind32();
(1 row)
+SELECT test_lsearch8();
+ test_lsearch8
+---------------
+
+(1 row)
+
+SELECT test_lsearch8_ge();
+ test_lsearch8_ge
+------------------
+
+(1 row)
+
diff --git a/src/test/modules/test_lfind/sql/test_lfind.sql b/src/test/modules/test_lfind/sql/test_lfind.sql
index 766c640831..d0dbb142ec 100644
--- a/src/test/modules/test_lfind/sql/test_lfind.sql
+++ b/src/test/modules/test_lfind/sql/test_lfind.sql
@@ -8,3 +8,5 @@ CREATE EXTENSION test_lfind;
SELECT test_lfind8();
SELECT test_lfind8_le();
SELECT test_lfind32();
+SELECT test_lsearch8();
+SELECT test_lsearch8_ge();
diff --git a/src/test/modules/test_lfind/test_lfind--1.0.sql b/src/test/modules/test_lfind/test_lfind--1.0.sql
index 81801926ae..13857cec3b 100644
--- a/src/test/modules/test_lfind/test_lfind--1.0.sql
+++ b/src/test/modules/test_lfind/test_lfind--1.0.sql
@@ -14,3 +14,11 @@ CREATE FUNCTION test_lfind8()
CREATE FUNCTION test_lfind8_le()
RETURNS pg_catalog.void
AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lsearch8()
+ RETURNS pg_catalog.void
+ AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lsearch8_ge()
+ RETURNS pg_catalog.void
+ AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_lfind/test_lfind.c b/src/test/modules/test_lfind/test_lfind.c
index 82673d54c6..c494c27436 100644
--- a/src/test/modules/test_lfind/test_lfind.c
+++ b/src/test/modules/test_lfind/test_lfind.c
@@ -14,6 +14,7 @@
#include "postgres.h"
#include "fmgr.h"
+#include "lib/stringinfo.h"
#include "port/pg_lfind.h"
/*
@@ -115,6 +116,144 @@ test_lfind8_le(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static void
+test_lsearch8_internal(uint8 key)
+{
+ uint8 charbuf[LEN_WITH_TAIL(Vector8)];
+ const int len_no_tail = LEN_NO_TAIL(Vector8);
+ const int len_with_tail = LEN_WITH_TAIL(Vector8);
+ int keypos;
+
+ memset(charbuf, 0xFF, len_with_tail);
+ /* search tail to test one-byte-at-a-time path */
+ keypos = len_with_tail - 1;
+ charbuf[keypos] = key;
+ if (key > 0x00 && (pg_lsearch8(key - 1, charbuf, len_with_tail) != -1))
+ elog(ERROR, "pg_lsearch8() found nonexistent element '0x%x'", key - 1);
+ if (key < 0xFF && (pg_lsearch8(key, charbuf, len_with_tail) != keypos))
+ elog(ERROR, "pg_lsearch8() did not find existing element '0x%x'", key);
+ if (key < 0xFE && (pg_lsearch8(key + 1, charbuf, len_with_tail) != -1))
+ elog(ERROR, "pg_lsearch8() found nonexistent element '0x%x'", key + 1);
+
+ memset(charbuf, 0xFF, len_with_tail);
+ /* search with vector operations */
+ keypos = len_no_tail - 1;
+ charbuf[keypos] = key;
+ if (key > 0x00 && (pg_lsearch8(key - 1, charbuf, len_no_tail) != -1))
+ elog(ERROR, "pg_lsearch8() found nonexistent element '0x%x'", key - 1);
+ if (key < 0xFF && (pg_lsearch8(key, charbuf, len_no_tail) != keypos))
+ elog(ERROR, "pg_lsearch8() did not find existing element '0x%x'", key);
+ if (key < 0xFE && (pg_lsearch8(key + 1, charbuf, len_no_tail) != -1))
+ elog(ERROR, "pg_lsearch8() found nonexistent element '0x%x'", key + 1);
+}
+
+PG_FUNCTION_INFO_V1(test_lsearch8);
+Datum
+test_lsearch8(PG_FUNCTION_ARGS)
+{
+ test_lsearch8_internal(0);
+ test_lsearch8_internal(1);
+ test_lsearch8_internal(0x7F);
+ test_lsearch8_internal(0x80);
+ test_lsearch8_internal(0x81);
+ test_lsearch8_internal(0xFD);
+ test_lsearch8_internal(0xFE);
+ test_lsearch8_internal(0xFF);
+
+ PG_RETURN_VOID();
+}
+
+static void
+report_lsearch8_error(uint8 *buf, int size, uint8 key, int result, int expected)
+{
+ StringInfoData bufstr;
+ char *sep = "";
+
+ initStringInfo(&bufstr);
+
+ for (int i = 0; i < size; i++)
+ {
+ appendStringInfo(&bufstr, "%s0x%02x", sep, buf[i]);
+ sep = ",";
+ }
+
+ elog(ERROR,
+ "pg_lsearch8_ge returned %d, expected %d, key 0x%02x buffer %s",
+ result, expected, key, bufstr.data);
+}
+
+/* workhorse for test_lsearch8_ge */
+static void
+test_lsearch8_ge_internal(uint8 *buf, uint8 key)
+{
+ const int len_no_tail = LEN_NO_TAIL(Vector8);
+ const int len_with_tail = LEN_WITH_TAIL(Vector8);
+ int expected;
+ int result;
+ int i;
+
+ /* search tail to test one-byte-at-a-time path */
+ for (i = 0; i < len_with_tail; i++)
+ {
+ if (buf[i] >= key)
+ break;
+ }
+ expected = i;
+ result = pg_lsearch8_ge(key, buf, len_with_tail);
+
+ if (result != expected)
+ report_lsearch8_error(buf, len_with_tail, key, result, expected);
+
+ /* search with vector operations */
+ for (i = 0; i < len_no_tail; i++)
+ {
+ if (buf[i] >= key)
+ break;
+ }
+ expected = i;
+ result = pg_lsearch8_ge(key, buf, len_no_tail);
+
+ if (result != expected)
+ report_lsearch8_error(buf, len_no_tail, key, result, expected);
+}
+
+static int
+cmp(const void *p1, const void *p2)
+{
+ uint8 v1 = *((const uint8 *) p1);
+ uint8 v2 = *((const uint8 *) p2);
+
+ if (v1 < v2)
+ return -1;
+ if (v1 > v2)
+ return 1;
+ return 0;
+}
+
+PG_FUNCTION_INFO_V1(test_lsearch8_ge);
+Datum
+test_lsearch8_ge(PG_FUNCTION_ARGS)
+{
+ uint8 charbuf[LEN_WITH_TAIL(Vector8)];
+ const int len_with_tail = LEN_WITH_TAIL(Vector8);
+
+ for (int i = 0; i < len_with_tail; i++)
+ charbuf[i] = (uint8) rand();
+
+ qsort(charbuf, len_with_tail, sizeof(uint8), cmp);
+
+ test_lsearch8_ge_internal(charbuf, 0);
+ test_lsearch8_ge_internal(charbuf, 1);
+ test_lsearch8_ge_internal(charbuf, 0x7F);
+ test_lsearch8_ge_internal(charbuf, 0x80);
+ test_lsearch8_ge_internal(charbuf, 0x81);
+ test_lsearch8_ge_internal(charbuf, 0xFD);
+ test_lsearch8_ge_internal(charbuf, 0xFE);
+ test_lsearch8_ge_internal(charbuf, 0xFF);
+
+ PG_RETURN_VOID();
+}
+
PG_FUNCTION_INFO_V1(test_lfind32);
Datum
test_lfind32(PG_FUNCTION_ARGS)
--
2.31.1
v6-0002-Add-radix-implementation.patchapplication/x-patch; name=v6-0002-Add-radix-implementation.patchDownload
From f49e91ec2a2dcb19259cbf1bc0fd73f36b29a201 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v6 2/3] Add radix implementation.
---
src/backend/lib/Makefile | 1 +
src/backend/lib/radixtree.c | 2225 +++++++++++++++++
src/include/lib/radixtree.h | 42 +
src/test/modules/Makefile | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 28 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 504 ++++
.../test_radixtree/test_radixtree.control | 4 +
12 files changed, 2854 insertions(+)
create mode 100644 src/backend/lib/radixtree.c
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
integerset.o \
knapsack.o \
pairingheap.o \
+ radixtree.o \
rbtree.o \
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..b163eac480
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2225 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
+ * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
+ * Also, there is no support for path compression and lazy path expansion. The
+ * radix tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves". We
+ * choose it to avoid an additional pointer traversal. It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create - Create a new, empty radix tree
+ * rt_free - Free the radix tree
+ * rt_search - Search a key-value pair
+ * rt_set - Set a key-value pair
+ * rt_delete - Delete a key-value pair
+ * rt_begin_iter - Begin iterating through all key-value pairs
+ * rt_iter_next - Return next key-value pair, if any
+ * rt_end_iter - End iteration
+ * rt_memory_usage - Get the memory usage
+ * rt_num_entries - Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-128 */
+#define RT_NODE_128_INVALID_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) \
+ ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+ RT_ACTION_FIND = 0, /* find the key-value */
+ RT_ACTION_DELETE, /* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree nodes.
+ *
+ * XXX: These are currently not well chosen. To reduce memory fragmentation
+ * smaller class should optimally fit neatly into the next larger class
+ * (except perhaps at the lowest end). Right now its
+ * 48 -> 152 -> 296 -> 1304 -> 2088 bytes for inner/leaf nodes, leading to
+ * large amounts of allocator padding with aset.c. Hence the use of slab.
+ *
+ * XXX: need to have node-1 until there is no path compression optimization?
+ *
+ * XXX: need to explain why we choose these node types based on benchmark
+ * results etc.
+ */
+typedef enum rt_node_kind
+{
+ RT_NODE_KIND_4 = 0,
+ RT_NODE_KIND_16,
+ RT_NODE_KIND_32,
+ RT_NODE_KIND_128,
+ RT_NODE_KIND_256
+} rt_node_kind;
+#define RT_NODE_KIND_COUNT (RT_NODE_KIND_256 + 1)
+
+/*
+ * Base type for all nodes types.
+ */
+typedef struct rt_node
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+ uint8 chunk;
+
+ /* Size class of the node */
+ rt_node_kind kind;
+} rt_node;
+
+/* Macros for radix tree nodes */
+#define IS_LEAF_NODE(n) (((rt_node *) (n))->shift == 0)
+#define IS_EMPTY_NODE(n) (((rt_node *) (n))->count == 0)
+#define NODE_HAS_FREE_SLOT(n) \
+ (((rt_node *) (n))->count < rt_node_info[((rt_node *) (n))->kind].fanout)
+
+/*
+ * Definitions of the base types for inner and leaf nodes of each node type.
+ */
+
+/*
+ * node-4, node-16, and node-32 have similar structure but have different
+ * the number of fanout. They have the same length for chunks and values
+ * (or child pointers in inner nodes). The chunks and values are stored at
+ * corresponding position and chunks are sorted.
+*/
+typedef struct rd_node_base_4
+{
+ rt_node n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+} rt_node_base_4;
+
+typedef struct rd_node_base_16
+{
+ rt_node n;
+
+ /* 16 children, for key chunks */
+ uint8 chunks[16];
+} rt_node_base_16;
+
+typedef struct rd_node_base_32
+{
+ rt_node n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-128 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 128 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rd_node_base_128
+{
+ rt_node n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(128)];
+} rt_node_base_128;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rd_node_base_256
+{
+ rt_node n;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * There are separate from inner node size classes for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ * width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+ rt_node_base_4 base;
+
+ /* 4 children, for key chunks */
+ rt_node *children[4];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+ rt_node_base_4 base;
+
+ /* 4 values, for key chunks */
+ uint64 values[4];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_16
+{
+ rt_node_base_16 base;
+
+ /* 16 children, for key chunks */
+ rt_node *children[16];
+} rt_node_inner_16;
+
+typedef struct rt_node_leaf_16
+{
+ rt_node_base_16 base;
+
+ /* 16 values, for key chunks */
+ uint64 values[16];
+} rt_node_leaf_16;
+
+typedef struct rt_node_inner_32
+{
+ rt_node_base_32 base;
+
+ /* 32 children, for key chunks */
+ rt_node *children[32];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+ rt_node_base_32 base;
+
+ /* 32 values, for key chunks */
+ uint64 values[32];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_128
+{
+ rt_node_base_128 base;
+
+ /* Slots for 128 children */
+ rt_node *children[128];
+} rt_node_inner_128;
+
+typedef struct rt_node_leaf_128
+{
+ rt_node_base_128 base;
+
+ /* Slots for 128 values */
+ uint64 values[128];
+} rt_node_leaf_128;
+
+typedef struct rt_node_inner_256
+{
+ rt_node_base_256 base;
+
+ /* Slots for 256 children */
+ rt_node *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+ rt_node_base_256 base;
+
+ /* Slots for 256 values */
+ uint64 values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information of each size class */
+typedef struct rt_node_info_elem
+{
+ const char *name;
+ int fanout;
+ Size inner_size;
+ Size leaf_size;
+} rt_node_info_elem;
+
+static rt_node_info_elem rt_node_info[RT_NODE_KIND_COUNT] = {
+
+ [RT_NODE_KIND_4] = {
+ .name = "radix tree node 4",
+ .fanout = 4,
+ .inner_size = sizeof(rt_node_inner_4),
+ .leaf_size = sizeof(rt_node_leaf_4),
+ },
+ [RT_NODE_KIND_16] = {
+ .name = "radix tree node 16",
+ .fanout = 16,
+ .inner_size = sizeof(rt_node_inner_16),
+ .leaf_size = sizeof(rt_node_leaf_16),
+ },
+ [RT_NODE_KIND_32] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(rt_node_inner_32),
+ .leaf_size = sizeof(rt_node_leaf_32),
+ },
+ [RT_NODE_KIND_128] = {
+ .name = "radix tree node 128",
+ .fanout = 128,
+ .inner_size = sizeof(rt_node_inner_128),
+ .leaf_size = sizeof(rt_node_leaf_128),
+ },
+ [RT_NODE_KIND_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(rt_node_inner_256),
+ .leaf_size = sizeof(rt_node_leaf_256),
+ },
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+ rt_node *node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+ radix_tree *tree;
+
+ /* Track the iteration on nodes of each level */
+ rt_node_iter stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ rt_node *root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+ MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_NODE_KIND_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_node_kind kind, bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_copy_node_common(rt_node *src, rt_node *dst);
+static void rt_extend(radix_tree *tree, uint64 key);
+static bool rt_node_search(rt_node *node, uint64 key, rt_action action, void **slot_p);
+static bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+ rt_node **child_p);
+static bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+ uint64 *value_p);
+static rt_node *rt_node_add_new_child(radix_tree *tree, rt_node *parent,
+ rt_node *node, uint64 key);
+static int rt_node_prepare_insert(radix_tree *tree, rt_node *parent,
+ rt_node **node_p, uint64 key,
+ bool *will_replace_p);
+static void rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, rt_node *child, bool *replaced_p);
+static void rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value, bool *replaced_p);
+static rt_node *rt_node_grow(radix_tree *tree, rt_node *parent,
+ rt_node *node, uint64 key);
+static void rt_update_iter_stack(rt_iter *iter, int from);
+static void *rt_node_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ bool *found_p);
+static void rt_update_node_iter(rt_iter *iter, rt_node_iter *node_iter,
+ rt_node *node);
+static pg_attribute_always_inline void rt_iter_update_key(rt_iter *iter, uint8 chunk,
+ uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/* Return the array of children in the given inner node */
+static rt_node **
+rt_node_get_children(rt_node *node)
+{
+ rt_node **children = NULL;
+
+ Assert(!IS_LEAF_NODE(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ children = (rt_node **) ((rt_node_inner_4 *) node)->children;
+ break;
+ case RT_NODE_KIND_16:
+ children = (rt_node **) ((rt_node_inner_16 *) node)->children;
+ break;
+ case RT_NODE_KIND_32:
+ children = (rt_node **) ((rt_node_inner_32 *) node)->children;
+ break;
+ case RT_NODE_KIND_128:
+ children = (rt_node **) ((rt_node_inner_128 *) node)->children;
+ break;
+ case RT_NODE_KIND_256:
+ children = (rt_node **) ((rt_node_inner_256 *) node)->children;
+ break;
+ default:
+ elog(ERROR, "unexpected node type %u", node->kind);
+ }
+
+ return children;
+}
+
+/* Return the array of values in the given leaf node */
+static uint64 *
+rt_node_get_values(rt_node *node)
+{
+ uint64 *values = NULL;
+
+ Assert(IS_LEAF_NODE(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ values = ((rt_node_leaf_4 *) node)->values;
+ break;
+ case RT_NODE_KIND_16:
+ values = ((rt_node_leaf_16 *) node)->values;
+ break;
+ case RT_NODE_KIND_32:
+ values = ((rt_node_leaf_32 *) node)->values;
+ break;
+ case RT_NODE_KIND_128:
+ values = ((rt_node_leaf_128 *) node)->values;
+ break;
+ case RT_NODE_KIND_256:
+ values = ((rt_node_leaf_256 *) node)->values;
+ break;
+ default:
+ elog(ERROR, "unexpected node type %u", node->kind);
+ }
+
+ return values;
+}
+
+/*
+ * Node support functions for node-4, node-16, and node-32.
+ *
+ * These three node types have similar structure -- they have the array of chunks with
+ * different length and corresponding pointers or values depending on inner nodes or
+ * leaf nodes.
+ */
+#define CHECK_CHUNK_ARRAY_NODE(node) \
+ Assert(((((rt_node*) node)->kind) == RT_NODE_KIND_4) || \
+ ((((rt_node*) node)->kind) == RT_NODE_KIND_16) || \
+ ((((rt_node*) node)->kind) == RT_NODE_KIND_32))
+
+/* Get the pointer to either the child or the value at 'idx */
+static void *
+chunk_array_node_get_slot(rt_node *node, int idx)
+{
+ void *slot;
+
+ CHECK_CHUNK_ARRAY_NODE(node);
+
+ if (IS_LEAF_NODE(node))
+ {
+ uint64 *values = rt_node_get_values(node);
+
+ slot = (void *) &(values[idx]);
+ }
+ else
+ {
+ rt_node **children = rt_node_get_children(node);
+
+ slot = (void *) children[idx];
+ }
+
+ return slot;
+}
+
+/* Return the chunk array in the node */
+static uint8 *
+chunk_array_node_get_chunks(rt_node *node)
+{
+ uint8 *chunk = NULL;
+
+ CHECK_CHUNK_ARRAY_NODE(node);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ chunk = (uint8 *) ((rt_node_base_4 *) node)->chunks;
+ break;
+ case RT_NODE_KIND_16:
+ chunk = (uint8 *) ((rt_node_base_16 *) node)->chunks;
+ break;
+ case RT_NODE_KIND_32:
+ chunk = (uint8 *) ((rt_node_base_32 *) node)->chunks;
+ break;
+ default:
+ /* this function don't support node-128 and node-256 */
+ elog(ERROR, "unsupported node type %d", node->kind);
+ }
+
+ return chunk;
+}
+
+/* Copy the contents of the node from 'src' to 'dst' */
+static void
+chunk_array_node_copy_contents(rt_node *src, rt_node *dst)
+{
+ uint8 *chunks_src,
+ *chunks_dst;
+
+ CHECK_CHUNK_ARRAY_NODE(src);
+ CHECK_CHUNK_ARRAY_NODE(dst);
+
+ /* Copy base type */
+ rt_copy_node_common(src, dst);
+
+ /* Copy chunk array */
+ chunks_src = chunk_array_node_get_chunks(src);
+ chunks_dst = chunk_array_node_get_chunks(dst);
+ memcpy(chunks_dst, chunks_src, sizeof(uint8) * src->count);
+
+ /* Copy children or values */
+ if (IS_LEAF_NODE(src))
+ {
+ uint64 *values_src,
+ *values_dst;
+
+ Assert(IS_LEAF_NODE(dst));
+ values_src = rt_node_get_values(src);
+ values_dst = rt_node_get_values(dst);
+ memcpy(values_dst, values_src, sizeof(uint64) * src->count);
+ }
+ else
+ {
+ rt_node **children_src,
+ **children_dst;
+
+ Assert(!IS_LEAF_NODE(dst));
+ children_src = rt_node_get_children(src);
+ children_dst = rt_node_get_children(dst);
+ memcpy(children_dst, children_src, sizeof(rt_node *) * src->count);
+ }
+}
+
+/*
+ * Return the index of the (sorted) chunk array where the chunk is inserted.
+ * Set true to replaced_p if the chunk already exists in the array.
+ */
+static int
+chunk_array_node_find_insert_pos(rt_node *node, uint8 chunk, bool *found_p)
+{
+ uint8 *chunks;
+ int idx;
+
+ CHECK_CHUNK_ARRAY_NODE(node);
+
+ *found_p = false;
+ chunks = chunk_array_node_get_chunks(node);
+
+ /* Find the insert pos */
+ idx = pg_lsearch8_ge(chunk, chunks, node->count);
+
+ if (idx < node->count && chunks[idx] == chunk)
+ *found_p = true;
+
+ return idx;
+}
+
+/* Delete the chunk at idx */
+static void
+chunk_array_node_delete(rt_node *node, int idx)
+{
+ uint8 *chunks = chunk_array_node_get_chunks(node);
+
+ /* delete the chunk from the chunk array */
+ memmove(&(chunks[idx]), &(chunks[idx + 1]),
+ sizeof(uint8) * (node->count - idx - 1));
+
+ /* delete either the value or the child as well */
+ if (IS_LEAF_NODE(node))
+ {
+ uint64 *values = rt_node_get_values(node);
+
+ memmove(&(values[idx]),
+ &(values[idx + 1]),
+ sizeof(uint64) * (node->count - idx - 1));
+ }
+ else
+ {
+ rt_node **children = rt_node_get_children(node);
+
+ memmove(&(children[idx]),
+ &(children[idx + 1]),
+ sizeof(rt_node *) * (node->count - idx - 1));
+ }
+}
+
+/* Support function for both node-128 */
+
+/* Does the given chunk in the node has the value? */
+static pg_attribute_always_inline bool
+node_128_is_chunk_used(rt_node_base_128 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static pg_attribute_always_inline bool
+node_128_is_slot_used(rt_node_base_128 *node, uint8 slot)
+{
+ return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+/* Get the pointer to either the child or the value corresponding to chunk */
+static void *
+node_128_get_slot(rt_node_base_128 *node, uint8 chunk)
+{
+ int slotpos;
+ void *slot;
+
+ slotpos = node->slot_idxs[chunk];
+ Assert(slotpos != RT_NODE_128_INVALID_IDX);
+
+ if (IS_LEAF_NODE(node))
+ slot = (void *) &(((rt_node_leaf_128 *) node)->values[slotpos]);
+ else
+ slot = (void *) (((rt_node_inner_128 *) node)->children[slotpos]);
+
+ return slot;
+}
+
+/* Delete the chunk in the node */
+static void
+node_128_delete(rt_node_base_128 *node, uint8 chunk)
+{
+ int slotpos = node->slot_idxs[chunk];
+
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+ node->slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+/* Return an unused slot in node-128 */
+static int
+node_128_find_unused_slot(rt_node_base_128 *node, uint8 chunk)
+{
+ int slotpos;
+
+ /*
+ * Find an unused slot. We iterate over the isset bitmap per byte then
+ * check each bit.
+ */
+ for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+ {
+ if (node->isset[slotpos] < 0xFF)
+ break;
+ }
+ Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+ slotpos *= BITS_PER_BYTE;
+ while (node_128_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+
+/* XXX: duplicate with node_128_set_leaf */
+static void
+node_128_set_inner(rt_node_base_128 *node, uint8 chunk, rt_node *child)
+{
+ int slotpos;
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+
+ /* Overwrite the existing value if exists */
+ if (node_128_is_chunk_used(node, chunk))
+ {
+ n128->children[n128->base.slot_idxs[chunk]] = child;
+ return;
+ }
+
+ /* find unused slot */
+ slotpos = node_128_find_unused_slot(node, chunk);
+
+ n128->base.slot_idxs[chunk] = slotpos;
+ n128->base.isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+ n128->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static void
+node_128_set_leaf(rt_node_base_128 *node, uint8 chunk, uint64 value)
+{
+ int slotpos;
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+
+ /* Overwrite the existing value if exists */
+ if (node_128_is_chunk_used(node, chunk))
+ {
+ n128->values[n128->base.slot_idxs[chunk]] = value;
+ return;
+ }
+
+ /* find unused slot */
+ slotpos = node_128_find_unused_slot(node, chunk);
+
+ n128->base.slot_idxs[chunk] = slotpos;
+ n128->base.isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+ n128->values[slotpos] = value;
+}
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static bool
+node_256_is_chunk_used(rt_node_base_256 *node, uint8 chunk)
+{
+ return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+/* Get the pointer to either the child or the value corresponding to chunk */
+static void *
+node_256_get_slot(rt_node_base_256 *node, uint8 chunk)
+{
+ void *slot;
+
+ Assert(node_256_is_chunk_used(node, chunk));
+ if (IS_LEAF_NODE(node))
+ slot = (void *) &(((rt_node_leaf_256 *) node)->values[chunk]);
+ else
+ slot = (void *) (((rt_node_inner_256 *) node)->children[chunk]);
+
+ return slot;
+}
+
+/* Set the child in the node-256 */
+static pg_attribute_always_inline void
+node_256_set_inner(rt_node_base_256 *node, uint8 chunk, rt_node *child)
+{
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ n256->base.isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+ n256->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static pg_attribute_always_inline void
+node_256_set_leaf(rt_node_base_256 *node, uint8 chunk, uint64 value)
+{
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ n256->base.isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+ n256->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static pg_attribute_always_inline void
+node_256_delete(rt_node_base_256 *node, uint8 chunk)
+{
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static pg_attribute_always_inline int
+key_get_shift(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+ int shift = key_get_shift(key);
+ rt_node *node;
+
+ node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift > 0);
+ node->shift = shift;
+ tree->max_val = shift_get_max_val(shift);
+ tree->root = node;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_node_kind kind, bool inner)
+{
+ rt_node *newnode;
+
+ if (inner)
+ newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_info[kind].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_info[kind].leaf_size);
+
+ newnode->kind = kind;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_128)
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) newnode;
+
+ memset(n128->slot_idxs, RT_NODE_128_INVALID_IDX, sizeof(n128->slot_idxs));
+ }
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[kind]++;
+#endif
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->root == node)
+ tree->root = NULL;
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[node->kind]--;
+
+ Assert(tree->cnt[node->kind] >= 0);
+#endif
+
+ pfree(node);
+}
+
+/* Copy the common fields without the node kind */
+static void
+rt_copy_node_common(rt_node *src, rt_node *dst)
+{
+ dst->shift = src->shift;
+ dst->chunk = src->chunk;
+ dst->count = src->count;
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+ int target_shift;
+ int shift = tree->root->shift + RT_NODE_SPAN;
+
+ target_shift = key_get_shift(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ rt_node_inner_4 *node =
+ (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4, true);
+
+ node->base.n.count = 1;
+ node->base.n.shift = shift;
+ node->base.chunks[0] = 0;
+ node->children[0] = tree->root;
+
+ tree->root->chunk = 0;
+ tree->root = (rt_node *) node;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * Search for the given key in the node. Return true if the key is found, otherwise
+ * return false. On success, we do the specified action for the key, and set the
+ * pointer to the slot to slot_p.
+ */
+static bool
+rt_node_search(rt_node *node, uint64 key, rt_action action, void **slot_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ case RT_NODE_KIND_16:
+ case RT_NODE_KIND_32:
+ {
+ int idx;
+ uint8 *chunks = chunk_array_node_get_chunks(node);
+
+ idx = pg_lsearch8(chunk, chunks, node->count);
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ *slot_p = chunk_array_node_get_slot(node, idx);
+ else /* RT_ACTION_DELETE */
+ chunk_array_node_delete(node, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+
+ if (!node_128_is_chunk_used(n128, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ *slot_p = node_128_get_slot(n128, chunk);
+ else /* RT_ACTION_DELETE */
+ node_128_delete(n128, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+
+ if (!node_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ *slot_p = node_256_get_slot(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* Update the statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ return found;
+}
+
+/*
+ * Search for the child pointer corresponding to the key in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+ rt_node *child;
+
+ if (!rt_node_search(node, key, action, (void **) &child))
+ return false;
+
+ if (child_p)
+ *child_p = child;
+
+ return true;
+}
+
+/*
+ * Search for the value corresponding to the key in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+ uint64 *value;
+
+ if (!rt_node_search(node, key, action, (void **) &value))
+ return false;
+
+ if (value_p)
+ *value_p = *value;
+
+ return true;
+}
+
+/* Insert 'node' as a child node of 'parent' */
+static rt_node *
+rt_node_add_new_child(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key)
+{
+ uint8 newshift = node->shift - RT_NODE_SPAN;
+ rt_node *newchild =
+ (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, newshift > 0);
+
+ Assert(!IS_LEAF_NODE(node));
+
+ newchild->shift = newshift;
+ newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+ rt_node_insert_inner(tree, parent, node, key, newchild, NULL);
+
+ return (rt_node *) newchild;
+}
+
+/*
+ * For a upcoming insertion, we make sure that the node has enough free slots or
+ * grow the node if necessary. node_p is updated with the grown node. We set true
+ * to will_replace_p to tell the caller that the given chunk already exists in the
+ * node.
+ *
+ * Return the index in the chunk array where the key can be inserted. We always
+ * return 0 in node-128 and node-256 cases.
+ */
+static int
+rt_node_prepare_insert(radix_tree *tree, rt_node *parent, rt_node **node_p,
+ uint64 key, bool *will_replace_p)
+{
+ rt_node *node = *node_p;
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool will_replace = false;
+ int idx = 0;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ case RT_NODE_KIND_16:
+ case RT_NODE_KIND_32:
+ {
+ bool can_insert = false;
+
+ while ((node->kind == RT_NODE_KIND_4) ||
+ (node->kind == RT_NODE_KIND_16) ||
+ (node->kind == RT_NODE_KIND_32))
+ {
+ /* Find the insert pos */
+ idx = chunk_array_node_find_insert_pos(node, chunk, &will_replace);
+
+ if (will_replace || NODE_HAS_FREE_SLOT(node))
+ {
+ /*
+ * Found. We can insert a new one or replace the exiting
+ * value.
+ */
+ can_insert = true;
+ break;
+ }
+
+ node = rt_node_grow(tree, parent, node, key);
+ }
+
+ if (can_insert)
+ {
+ uint8 *chunks = chunk_array_node_get_chunks(node);
+
+ Assert(idx >= 0);
+
+ /*
+ * Make the space for the new key if it will be inserted in
+ * the middle of the array.
+ */
+ if (!will_replace && node->count != 0 && idx < node->count)
+ {
+ /* shift chunks array */
+ memmove(&(chunks[idx + 1]), &(chunks[idx]),
+ sizeof(uint8) * (node->count - idx));
+
+ /* shift either the values array or the children array */
+ if (IS_LEAF_NODE(node))
+ {
+ uint64 *values = rt_node_get_values(node);
+
+ memmove(&(values[idx + 1]), &(values[idx]),
+ sizeof(uint64) * (node->count - idx));
+ }
+ else
+ {
+ rt_node **children = rt_node_get_children(node);
+
+ memmove(&(children[idx + 1]), &(children[idx]),
+ sizeof(rt_node *) * (node->count - idx));
+ }
+ }
+
+ break;
+ }
+
+ Assert(node->kind == RT_NODE_KIND_128);
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+
+ if (node_128_is_chunk_used(n128, chunk) || NODE_HAS_FREE_SLOT(n128))
+ {
+ if (node_128_is_chunk_used(n128, chunk))
+ will_replace = true;
+
+ break;
+ }
+
+ node = rt_node_grow(tree, parent, node, key);
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+
+ if (node_256_is_chunk_used(n256, chunk))
+ will_replace = true;
+
+ break;
+ }
+ }
+
+ *node_p = node;
+ *will_replace_p = will_replace;
+
+ return idx;
+}
+
+/* Insert the child to the inner node */
+static void
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, rt_node *child, bool *replaced_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ int idx;
+ bool replaced;
+
+ Assert(!IS_LEAF_NODE(node));
+
+ idx = rt_node_prepare_insert(tree, parent, &node, key, &replaced);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ case RT_NODE_KIND_16:
+ case RT_NODE_KIND_32:
+ {
+ uint8 *chunks = chunk_array_node_get_chunks(node);
+ rt_node **children = rt_node_get_children(node);
+
+ Assert(idx >= 0);
+ chunks[idx] = chunk;
+ children[idx] = child;
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ node_128_set_inner((rt_node_base_128 *) node, chunk, child);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ node_256_set_inner((rt_node_base_256 *) node, chunk, child);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!replaced)
+ node->count++;
+
+ if (replaced_p)
+ *replaced_p = replaced;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+}
+
+/* Insert the value to the leaf node */
+static void
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value, bool *replaced_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ int idx;
+ bool replaced;
+
+ Assert(IS_LEAF_NODE(node));
+
+ idx = rt_node_prepare_insert(tree, parent, &node, key, &replaced);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ case RT_NODE_KIND_16:
+ case RT_NODE_KIND_32:
+ {
+ uint8 *chunks = chunk_array_node_get_chunks(node);
+ uint64 *values = rt_node_get_values(node);
+
+ Assert(idx >= 0);
+ chunks[idx] = chunk;
+ values[idx] = value;
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ node_128_set_leaf((rt_node_base_128 *) node, chunk, value);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ node_256_set_leaf((rt_node_base_256 *) node, chunk, value);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!replaced)
+ node->count++;
+
+ *replaced_p = replaced;
+
+ /*
+ * Done. Finally, verify if the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+}
+
+/* Change the node type to the next larger one */
+static rt_node *
+rt_node_grow(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key)
+{
+ rt_node *newnode = NULL;
+
+ Assert(node->count == rt_node_info[node->kind].fanout);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ newnode = rt_alloc_node(tree, RT_NODE_KIND_16,
+ IS_LEAF_NODE(node));
+
+ /* Copy both chunks and slots to the new node */
+ chunk_array_node_copy_contents(node, newnode);
+ break;
+ }
+ case RT_NODE_KIND_16:
+ {
+ newnode = rt_alloc_node(tree, RT_NODE_KIND_32,
+ IS_LEAF_NODE(node));
+
+ /* Copy both chunks and slots to the new node */
+ chunk_array_node_copy_contents(node, newnode);
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ newnode = rt_alloc_node(tree, RT_NODE_KIND_128,
+ IS_LEAF_NODE(node));
+
+ /* Copy both chunks and slots to the new node */
+ rt_copy_node_common(node, newnode);
+
+ if (IS_LEAF_NODE(node))
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+ for (int i = 0; i < node->count; i++)
+ node_128_set_leaf((rt_node_base_128 *) newnode,
+ n32->base.chunks[i], n32->values[i]);
+ }
+ else
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+ for (int i = 0; i < node->count; i++)
+ node_128_set_inner((rt_node_base_128 *) newnode,
+ n32->base.chunks[i], n32->children[i]);
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+ int cnt = 0;
+
+ newnode = rt_alloc_node(tree, RT_NODE_KIND_256,
+ IS_LEAF_NODE(node));
+
+ /* Copy both chunks and slots to the new node */
+ rt_copy_node_common(node, newnode);
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->n.count; i++)
+ {
+ void *slot;
+
+ if (!node_128_is_chunk_used(n128, i))
+ continue;
+
+ slot = node_128_get_slot(n128, i);
+
+ if (IS_LEAF_NODE(node))
+ node_256_set_leaf((rt_node_base_256 *) newnode, i,
+ *(uint64 *) slot);
+ else
+ node_256_set_inner((rt_node_base_256 *) newnode, i,
+ (rt_node *) slot);
+
+ cnt++;
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ elog(ERROR, "radix tree node-256 cannot grow");
+ break;
+ }
+
+ if (parent == node)
+ {
+ /* Replace the root node with the new large node */
+ tree->root = newnode;
+ }
+ else
+ {
+ /* Set the new node to the parent node */
+ rt_node_insert_inner(tree, NULL, parent, key, newnode, NULL);
+ }
+
+ /* Verify if the node has grown properly */
+ rt_verify_node(newnode);
+
+ /* Free the old node */
+ rt_free_node(tree, node);
+
+ return newnode;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+ radix_tree *tree;
+ MemoryContext old_ctx;
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = palloc(sizeof(radix_tree));
+ tree->context = ctx;
+ tree->root = NULL;
+ tree->max_val = 0;
+ tree->num_keys = 0;
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_node_info[i].name,
+ SLAB_DEFAULT_BLOCK_SIZE,
+ rt_node_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_node_info[i].name,
+ SLAB_DEFAULT_BLOCK_SIZE,
+ rt_node_info[i].leaf_size);
+#ifdef RT_DEBUG
+ tree->cnt[i] = 0;
+#endif
+ }
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+ int shift;
+ bool replaced;
+ rt_node *node;
+ rt_node *parent = tree->root;
+
+ /* Empty tree, create the root */
+ if (!tree->root)
+ rt_new_root(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->max_val)
+ rt_extend(tree, key);
+
+ Assert(tree->root);
+
+ shift = tree->root->shift;
+ node = tree->root;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (IS_LEAF_NODE(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ child = rt_node_add_new_child(tree, parent, node, key);
+
+ Assert(child);
+
+ parent = node;
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* arrived at a leaf */
+ Assert(IS_LEAF_NODE(node));
+
+ rt_node_insert_leaf(tree, parent, node, key, value, &replaced);
+
+ /* Update the statistics */
+ if (!replaced)
+ tree->num_keys++;
+
+ return replaced;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+ rt_node *node;
+ int shift;
+
+ Assert(value_p != NULL);
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ node = tree->root;
+ shift = tree->root->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (IS_LEAF_NODE(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* We reached at a leaf node, so search the corresponding slot */
+ Assert(IS_LEAF_NODE(node));
+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p))
+ return false;
+
+ return true;
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ int shift;
+ rt_node *stack[RT_MAX_LEVEL] = {0};
+ int level;
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes
+ * we visited.
+ */
+ node = tree->root;
+ shift = tree->root->shift;
+ level = 0;
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ /* Push the current node to the stack */
+ stack[level] = node;
+
+ if (IS_LEAF_NODE(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+
+ Assert(IS_LEAF_NODE(node));
+
+ /* there is no key to delete */
+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL))
+ return false;
+
+ /* Update the statistics */
+ tree->num_keys--;
+
+ /*
+ * Delete the key from the leaf node and recursively delete the key in
+ * inner nodes if necessary.
+ */
+ Assert(IS_LEAF_NODE(stack[level]));
+ while (level >= 0)
+ {
+ rt_node *node = stack[level--];
+
+ if (IS_LEAF_NODE(node))
+ rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+ else
+ rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!IS_EMPTY_NODE(node))
+ break;
+
+ /* The node became empty */
+ rt_free_node(tree, node);
+ }
+
+ /*
+ * If we eventually deleted the root node while recursively deleting empty
+ * nodes, we make the tree empty.
+ */
+ if (level == 0)
+ {
+ tree->root = NULL;
+ tree->max_val = 0;
+ }
+
+ return true;;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+ MemoryContext old_ctx;
+ rt_iter *iter;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (rt_iter *) palloc0(sizeof(rt_iter));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree)
+ return iter;
+
+ top_level = iter->tree->root->shift / RT_NODE_SPAN;
+
+ iter->stack_len = top_level;
+ iter->stack[top_level].node = iter->tree->root;
+ iter->stack[top_level].current_idx = -1;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ rt_update_iter_stack(iter, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Update the stack of the radix tree node while descending to the leaf from
+ * the 'from' level.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, int from)
+{
+ rt_node *node = iter->stack[from].node;
+ int level = from;
+
+ for (;;)
+ {
+ rt_node_iter *node_iter = &(iter->stack[level--]);
+ bool found;
+
+ /* Set the node to this level */
+ rt_update_node_iter(iter, node_iter, node);
+
+ /* Finish if we reached to the leaf node */
+ if (IS_LEAF_NODE(node))
+ break;
+
+ /* Advance to the next slot in the node */
+ node = (rt_node *) rt_node_iterate_next(iter, node_iter, &found);
+
+ /*
+ * Since we always get the first slot in the node, we have to found
+ * the slot.
+ */
+ Assert(found);
+ }
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+ bool found = false;
+ void *slot;
+
+ /* Empty tree */
+ if (!iter->tree)
+ return false;
+
+ for (;;)
+ {
+ rt_node *node;
+ rt_node_iter *node_iter;
+ int level;
+
+ /*
+ * Iterate node at each level from the bottom of the tree, i.e., the
+ * lead node, until we find the next slot.
+ */
+ for (level = 0; level <= iter->stack_len; level++)
+ {
+ slot = rt_node_iterate_next(iter, &(iter->stack[level]), &found);
+
+ if (found)
+ break;
+ }
+
+ /* We could not find any new key-value pair, the iteration finished */
+ if (!found)
+ break;
+
+ /* found the next slot at the leaf node, return it */
+ if (level == 0)
+ {
+ *key_p = iter->key;
+ *value_p = *((uint64 *) slot);
+ break;
+ }
+
+ /*
+ * We have advanced slots more than one nodes including both the lead
+ * node and inner nodes. So we update the stack by descending to
+ * the left most leaf node from this level.
+ */
+ node = (rt_node *) (rt_node *) slot;
+ node_iter = &(iter->stack[level - 1]);
+ rt_update_node_iter(iter, node_iter, node);
+ rt_update_iter_stack(iter, level - 1);
+ }
+
+ return found;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+ pfree(iter);
+}
+
+/*
+ * Iterate over the given radix tree node and returns the next slot of the given
+ * node and set true to *found_p, if any. Otherwise, set false to *found_p.
+ */
+static void *
+rt_node_iterate_next(rt_iter *iter, rt_node_iter *node_iter, bool *found_p)
+{
+ rt_node *node = node_iter->node;
+ void *slot = NULL;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ case RT_NODE_KIND_16:
+ case RT_NODE_KIND_32:
+ {
+ node_iter->current_idx++;
+
+ if (node_iter->current_idx >= node->count)
+ goto not_found;
+
+ slot = chunk_array_node_get_slot(node, node_iter->current_idx);
+
+ /* Update the part of the key by the current chunk */
+ if (IS_LEAF_NODE(node))
+ {
+ uint8 *chunks = chunk_array_node_get_chunks(node);
+
+ rt_iter_update_key(iter, chunks[node_iter->current_idx], 0);
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < 256; i++)
+ {
+ if (node_128_is_chunk_used(n128, i))
+ break;
+ }
+
+ if (i >= 256)
+ goto not_found;
+
+ node_iter->current_idx = i;
+ slot = node_128_get_slot(n128, i);
+
+ /* Update the part of the key */
+ if (IS_LEAF_NODE(n128))
+ rt_iter_update_key(iter, node_iter->current_idx, 0);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < 256; i++)
+ {
+ if (node_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= 256)
+ goto not_found;
+
+ node_iter->current_idx = i;
+ slot = node_256_get_slot(n256, i);
+
+ /* Update the part of the key */
+ if (IS_LEAF_NODE(n256))
+ rt_iter_update_key(iter, node_iter->current_idx, 0);
+
+ break;
+ }
+ }
+
+ Assert(slot);
+ *found_p = true;
+ return slot;
+
+not_found:
+ *found_p = false;
+ return NULL;
+}
+
+/*
+ * Set the node to the node_iter so we can begin the iteration of the node.
+ * Also, we update the part of the key by the chunk of the given node.
+ */
+static void
+rt_update_node_iter(rt_iter *iter, rt_node_iter *node_iter,
+ rt_node *node)
+{
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ rt_iter_update_key(iter, node->chunk, node->shift + RT_NODE_SPAN);
+}
+
+static pg_attribute_always_inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+ return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+ Size total = 0;
+
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ case RT_NODE_KIND_16:
+ case RT_NODE_KIND_32:
+ {
+ uint8 *chunks = chunk_array_node_get_chunks(node);
+
+ /* Check if the chunks in the node are sorted */
+ for (int i = 1; i < node->count; i++)
+ Assert(chunks[i - 1] < chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_128_is_chunk_used(n128, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(node_128_is_slot_used(n128, n128->slot_idxs[i]));
+
+ cnt++;
+ }
+
+ Assert(n128->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+ cnt += pg_popcount32(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->n.count == cnt);
+
+ break;
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+ fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u, n16 = %u,n32 = %u, n128 = %u, n256 = %u",
+ tree->num_keys,
+ tree->root->shift / RT_NODE_SPAN,
+ tree->cnt[0],
+ tree->cnt[1],
+ tree->cnt[2],
+ tree->cnt[3],
+ tree->cnt[4]);
+ /* rt_dump(tree); */
+}
+
+static void
+rt_print_slot(StringInfo buf, uint8 chunk, uint64 value, int idx, bool is_leaf, int level)
+{
+ char space[128] = {0};
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ if (is_leaf)
+ appendStringInfo(buf, "%s[%d] \"0x%X\" val(0x%lX) LEAF\n",
+ space,
+ idx,
+ chunk,
+ value);
+ else
+ appendStringInfo(buf, "%s[%d] \"0x%X\" -> ",
+ space,
+ idx,
+ chunk);
+}
+
+static void
+rt_dump_node(rt_node *node, int level, StringInfo buf, bool recurse)
+{
+ bool is_leaf = IS_LEAF_NODE(node);
+
+ appendStringInfo(buf, "[\"%s\" type %d, cnt %u, shift %u, chunk \"0x%X\"] chunks:\n",
+ IS_LEAF_NODE(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_128) ? 128 : 256,
+ node->count, node->shift, node->chunk);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ case RT_NODE_KIND_16:
+ case RT_NODE_KIND_32:
+ {
+ uint8 *chunks = chunk_array_node_get_chunks(node);
+
+ for (int i = 0; i < node->count; i++)
+ {
+ if (IS_LEAF_NODE(node))
+ {
+ uint64 *values = rt_node_get_values(node);
+
+ rt_print_slot(buf, chunks[i],
+ values[i],
+ i, is_leaf, level);
+ }
+ else
+ rt_print_slot(buf, chunks[i],
+ UINT64_MAX,
+ i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ rt_node **children = rt_node_get_children(node);
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ rt_dump_node(children[i],
+ level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+ uint8 *tmp = (uint8 *) n128->isset;
+
+ appendStringInfo(buf, "slot_idxs:");
+ for (int j = 0; j < 256; j++)
+ {
+ if (!node_128_is_chunk_used(n128, j))
+ continue;
+
+ appendStringInfo(buf, " [%d]=%d, ", j, n128->slot_idxs[j]);
+ }
+ appendStringInfo(buf, "\nisset-bitmap:");
+ for (int j = 0; j < 16; j++)
+ {
+ appendStringInfo(buf, "%X ", (uint8) tmp[j]);
+ }
+ appendStringInfo(buf, "\n");
+
+ for (int i = 0; i < 256; i++)
+ {
+ void *slot;
+
+ if (!node_128_is_chunk_used(n128, i))
+ continue;
+
+ slot = node_128_get_slot(n128, i);
+
+ if (is_leaf)
+ rt_print_slot(buf, i, *(uint64 *) slot,
+ i, is_leaf, level);
+ else
+ rt_print_slot(buf, i, UINT64_MAX, i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ rt_dump_node((rt_node *) slot,
+ level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+
+ for (int i = 0; i < 256; i++)
+ {
+ void *slot;
+
+ if (!node_256_is_chunk_used(n256, i))
+ continue;
+
+ slot = node_256_get_slot(n256, i);
+
+ if (is_leaf)
+ rt_print_slot(buf, i, *(uint64 *) slot, i, is_leaf, level);
+ else
+ rt_print_slot(buf, i, UINT64_MAX, i, is_leaf, level);
+
+ if (!is_leaf)
+ {
+ if (recurse)
+ {
+ StringInfoData buf2;
+
+ initStringInfo(&buf2);
+ rt_dump_node((rt_node *) slot, level + 1, &buf2, recurse);
+ appendStringInfo(buf, "%s", buf2.data);
+ }
+ else
+ appendStringInfo(buf, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+ StringInfoData buf;
+ rt_node *node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+ if (!tree->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->max_val)
+ {
+ elog(NOTICE, "key %lu (0x%lX) is larger than max val",
+ key, key);
+ return;
+ }
+
+ initStringInfo(&buf);
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ rt_dump_node(node, level, &buf, false);
+
+ if (IS_LEAF_NODE(node))
+ {
+ uint64 dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+ break;
+ }
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+
+ elog(NOTICE, "\n%s", buf.data);
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+ StringInfoData buf;
+
+ initStringInfo(&buf);
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = %lu", tree->max_val);
+ rt_dump_node(tree->root, 0, &buf, true);
+ elog(NOTICE, "\n%s", buf.data);
+ elog(NOTICE, "-----------------------------------------------------------");
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..38cc6abf4c
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+/* #define RT_DEBUG 1 */
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif /* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 6c31c8707c..8252ec41c4 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -25,6 +25,7 @@ SUBDIRS = \
test_parser \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..cc6970c87c
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,28 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..a4aa80a99c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,504 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+/* The maximum number of entries each node type can have */
+static int rt_node_max_entries[] = {
+ 4, /* RT_NODE_KIND_4 */
+ 16, /* RT_NODE_KIND_16 */
+ 32, /* RT_NODE_KIND_32 */
+ 128, /* RT_NODE_KIND_128 */
+ 256 /* RT_NODE_KIND_256 */
+};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 10000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ radix_tree *radixtree;
+ uint64 dummy;
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ uint64 val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, key);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+ for (int j = 0; j < lengthof(rt_node_max_entries); j++)
+ {
+ /*
+ * After filling all slots in each node type, check if the values are
+ * stored properly.
+ */
+ if (i == (rt_node_max_entries[j] - 1))
+ {
+ check_search_on_node(radixtree, shift,
+ (j == 0) ? 0 : rt_node_max_entries[j - 1],
+ rt_node_max_entries[j]);
+ break;
+ }
+ }
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "inserted key 0x" UINT64_HEX_FORMAT " is not found", key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ radix_tree *radixtree;
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search
+ * entries again.
+ */
+ test_node_types_insert(radixtree, shift);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift);
+
+ rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec *spec)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+ radixtree = rt_create(radixtree_ctx);
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the
+ * stats from the memory context. They should be in the same ballpark,
+ * but it's hard to automate testing that, so if you're making changes to
+ * the implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
--
2.31.1
v6-0003-tool-for-measuring-radix-tree-performance.patchapplication/x-patch; name=v6-0003-tool-for-measuring-radix-tree-performance.patchDownload
From 39f0019d95eb4808d235a07d107aee2ff46856e2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v6 3/3] tool for measuring radix tree performance
---
contrib/bench_radix_tree/Makefile | 21 ++
.../bench_radix_tree--1.0.sql | 42 +++
contrib/bench_radix_tree/bench_radix_tree.c | 301 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
6 files changed, 399 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..b8f70e12d1
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..6663abe6a4
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,42 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..5806ef7519
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,301 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+
+static radix_tree *rt = NULL;
+static ItemPointer itemptrs = NULL;
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint32 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper-lower)+0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptrs[j];
+
+ itemptrs[j] = itemptrs[i];
+ itemptrs[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time, end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms, rt_search_ms, ar_load_ms, ar_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint64 key, val;
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ rt_search(rt, key, &val);
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+
+ bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(ar_load_ms);
+ values[5] = Int64GetDatum(rt_search_ms);
+ values[6] = Int64GetDatum(ar_search_ms);
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time, end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
--
2.31.1
On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Aug 15, 2022 at 10:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:bool, buth = and <=. Should be pretty close. Also, i believe if you
left this for last as a possible refactoring, it might save some work.
v6 demonstrates why this should have been put off towards the end. (more below)
In any case, I'll take a look at the latest patch next month.
Since the CF entry said "Needs Review", I began looking at v5 again
this week. Hopefully not too much has changed, but in the future I
strongly recommend setting to "Waiting on Author" if a new version is
forthcoming. I realize many here share updated patches at any time,
but I'd like to discourage the practice especially for large patches.
I've updated the radix tree patch. It's now separated into two patches.
0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find
better names) that are similar to the pg_lfind8() family but they
return the index of the key in the vector instead of true/false. The
patch includes regression tests.
I don't want to do a full review of this just yet, but I'll just point
out some problems from a quick glance.
+/*
+ * Return the index of the first element in the vector that is greater than
+ * or eual to the given scalar. Return sizeof(Vector8) if there is no such
+ * element.
That's a bizarre API to indicate non-existence.
+ *
+ * Note that this function assumes the elements in the vector are sorted.
+ */
That is *completely* unacceptable for a general-purpose function.
+#else /* USE_NO_SIMD */
+ Vector8 r = 0;
+ uint8 *rp = (uint8 *) &r;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ rp[i] = (((const uint8 *) &v1)[i] == ((const uint8 *) &v2)[i]) ? 0xFF : 0;
I don't think we should try to force the non-simd case to adopt the
special semantics of vector comparisons. It's much easier to just use
the same logic as the assert builds.
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t)
vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
For Arm, we need to be careful here. This article goes into a lot of
detail for this situation:
Here again, I'd rather put this off and focus on getting the "large
details" in good enough shape so we can got towards integrating with
vacuum.
In addition to two patches, I've attached the third patch. It's not
part of radix tree implementation but introduces a contrib module
bench_radix_tree, a tool for radix tree performance benchmarking. It
measures loading and lookup performance of both the radix tree and a
flat array.
Excellent! This was high on my wish list.
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Sep 16, 2022 at 02:54:14PM +0700, John Naylor wrote:
Here again, I'd rather put this off and focus on getting the "large
details" in good enough shape so we can got towards integrating with
vacuum.
I started a new thread for the SIMD patch [0]/messages/by-id/20220917052903.GA3172400@nathanxps13 so that this thread can
remain focused on the radix tree stuff.
[0]: /messages/by-id/20220917052903.GA3172400@nathanxps13
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Fri, Sep 16, 2022 at 4:54 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Aug 15, 2022 at 10:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:bool, buth = and <=. Should be pretty close. Also, i believe if you
left this for last as a possible refactoring, it might save some work.v6 demonstrates why this should have been put off towards the end. (more below)
In any case, I'll take a look at the latest patch next month.
Since the CF entry said "Needs Review", I began looking at v5 again
this week. Hopefully not too much has changed, but in the future I
strongly recommend setting to "Waiting on Author" if a new version is
forthcoming. I realize many here share updated patches at any time,
but I'd like to discourage the practice especially for large patches.
Understood. Sorry for the inconveniences.
I've updated the radix tree patch. It's now separated into two patches.
0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find
better names) that are similar to the pg_lfind8() family but they
return the index of the key in the vector instead of true/false. The
patch includes regression tests.I don't want to do a full review of this just yet, but I'll just point
out some problems from a quick glance.+/* + * Return the index of the first element in the vector that is greater than + * or eual to the given scalar. Return sizeof(Vector8) if there is no such + * element.That's a bizarre API to indicate non-existence.
+ * + * Note that this function assumes the elements in the vector are sorted. + */That is *completely* unacceptable for a general-purpose function.
+#else /* USE_NO_SIMD */ + Vector8 r = 0; + uint8 *rp = (uint8 *) &r; + + for (Size i = 0; i < sizeof(Vector8); i++) + rp[i] = (((const uint8 *) &v1)[i] == ((const uint8 *) &v2)[i]) ? 0xFF : 0;I don't think we should try to force the non-simd case to adopt the
special semantics of vector comparisons. It's much easier to just use
the same logic as the assert builds.+#ifdef USE_SSE2 + return (uint32) _mm_movemask_epi8(v); +#elif defined(USE_NEON) + static const uint8 mask[16] = { + 1 << 0, 1 << 1, 1 << 2, 1 << 3, + 1 << 4, 1 << 5, 1 << 6, 1 << 7, + 1 << 0, 1 << 1, 1 << 2, 1 << 3, + 1 << 4, 1 << 5, 1 << 6, 1 << 7, + }; + + uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7)); + uint8x16_t maskedhi = vextq_u8(masked, masked, 8); + + return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));For Arm, we need to be careful here. This article goes into a lot of
detail for this situation:Here again, I'd rather put this off and focus on getting the "large
details" in good enough shape so we can got towards integrating with
vacuum.
Thank you for the comments! These above comments are addressed by
Nathan in a newly derived thread. I'll work on the patch.
I'll consider how to integrate with vacuum as the next step. One
concern for me is how to limit the memory usage to
maintenance_work_mem. Unlike using a flat array, memory space for
adding one TID varies depending on the situation. If we want strictly
not to allow using memory more than maintenance_work_mem, probably we
need to estimate the memory consumption in a conservative way.
Regards,
--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Tue, Sep 20, 2022 at 3:19 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Fri, Sep 16, 2022 at 4:54 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
Here again, I'd rather put this off and focus on getting the "large
details" in good enough shape so we can got towards integrating with
vacuum.Thank you for the comments! These above comments are addressed by
Nathan in a newly derived thread. I'll work on the patch.
I still seem to be out-voted on when to tackle this particular
optimization, so I've extended the v6 benchmark code with a hackish
function that populates a fixed number of keys, but with different fanouts.
(diff attached as a text file)
I didn't take particular care to make this scientific, but the following
seems pretty reproducible. Note what happens to load and search performance
when node16 has 15 entries versus 16:
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+--------+------------------+------------+--------------
15 | 327680 | 3776512 | 39 | 20
(1 row)
num_keys = 327680, height = 4, n4 = 1, n16 = 23408, n32 = 0, n128 = 0, n256
= 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+--------+------------------+------------+--------------
16 | 327680 | 3514368 | 25 | 11
(1 row)
num_keys = 327680, height = 4, n4 = 0, n16 = 21846, n32 = 0, n128 = 0, n256
= 0
In trying to wrap the SIMD code behind layers of abstraction, the latest
patch (and Nathan's cleanup) threw it away in almost all cases. To explain,
we need to talk about how vectorized code deals with the "tail" that is too
small for the register:
1. Use a one-by-one algorithm, like we do for the pg_lfind* variants.
2. Read some junk into the register and mask off false positives from the
result.
There are advantages to both depending on the situation.
Patch v5 and earlier used #2. Patch v6 used #1, so if a node16 has 15
elements or less, it will iterate over them one-by-one exactly like a
node4. Only when full with 16 will the vector path be taken. When another
entry is added, the elements are copied to the next bigger node, so there's
a *small* window where it's fast.
In short, this code needs to be lower level so that we still have full
control while being portable. I will work on this, and also the related
code for node dispatch.
Since v6 has some good infrastructure to do low-level benchmarking, I also
want to do some experiments with memory management.
(I have further comments about the code, but I will put that off until
later)
I'll consider how to integrate with vacuum as the next step. One
concern for me is how to limit the memory usage to
maintenance_work_mem. Unlike using a flat array, memory space for
adding one TID varies depending on the situation. If we want strictly
not to allow using memory more than maintenance_work_mem, probably we
need to estimate the memory consumption in a conservative way.
+1
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v6addendum-bench-node16.diff.txttext/plain; charset=US-ASCII; name=v6addendum-bench-node16.diff.txtDownload
commit 18407962e96ccec6c9aeeba97412edd762a5a4fe
Author: John Naylor <john.naylor@postgresql.org>
Date: Wed Sep 21 11:44:43 2022 +0700
Add special benchmark function to test effect of fanout
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
index b8f70e12d1..952bb0ceae 100644
--- a/contrib/bench_radix_tree/Makefile
+++ b/contrib/bench_radix_tree/Makefile
@@ -7,7 +7,7 @@ OBJS = \
EXTENSION = bench_radix_tree
DATA = bench_radix_tree--1.0.sql
-REGRESS = bench
+REGRESS = bench_fixed_height
ifdef USE_PGXS
PG_CONFIG = pg_config
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 6663abe6a4..f2fee15b17 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -40,3 +40,15 @@ OUT load_ms int8)
returns record
as 'MODULE_PATHNAME'
LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 5806ef7519..0778da2d7b 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -13,6 +13,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "lib/radixtree.h"
+#include <math.h>
#include "miscadmin.h"
#include "utils/timestamp.h"
@@ -24,6 +25,7 @@ PG_MODULE_MAGIC;
PG_FUNCTION_INFO_V1(bench_seq_search);
PG_FUNCTION_INFO_V1(bench_shuffle_search);
PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
static radix_tree *rt = NULL;
static ItemPointer itemptrs = NULL;
@@ -299,3 +301,108 @@ bench_load_random_int(PG_FUNCTION_ARGS)
rt_free(rt);
PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time, end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms, rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r, h, i, j, k;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ /* lower nodes have limited fanout, the top is only limited by bits-per-byte */
+ for (r=1;;r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+ key = (r<<32) | (h<<24) | (i<<16) | (j<<8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r=1;;r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key, val;
+ key = (r<<32) | (h<<24) | (i<<16) | (j<<8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/expected/bench_fixed_height.out b/contrib/bench_radix_tree/expected/bench_fixed_height.out
new file mode 100644
index 0000000000..c4995afc13
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench_fixed_height.out
@@ -0,0 +1,6 @@
+create extension bench_radix_tree;
+\o fixed_height_search.data
+begin;
+select * from bench_fixed_height_search(15);
+select * from bench_fixed_height_search(16);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench_fixed_height.sql b/contrib/bench_radix_tree/sql/bench_fixed_height.sql
new file mode 100644
index 0000000000..0c06570e9a
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench_fixed_height.sql
@@ -0,0 +1,7 @@
+create extension bench_radix_tree;
+
+\o fixed_height_search.data
+begin;
+select * from bench_fixed_height_search(15);
+select * from bench_fixed_height_search(16);
+commit;
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index b163eac480..4ce8e9ad9d 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -1980,7 +1980,7 @@ rt_verify_node(rt_node *node)
void
rt_stats(radix_tree *tree)
{
- fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u, n16 = %u,n32 = %u, n128 = %u, n256 = %u",
+ fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u, n16 = %u,n32 = %u, n128 = %u, n256 = %u\n",
tree->num_keys,
tree->root->shift / RT_NODE_SPAN,
tree->cnt[0],
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 38cc6abf4c..6016d593ee 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -15,7 +15,7 @@
#include "postgres.h"
-/* #define RT_DEBUG 1 */
+#define RT_DEBUG 1
typedef struct radix_tree radix_tree;
typedef struct rt_iter rt_iter;
On Wed, Sep 21, 2022 at 01:17:21PM +0700, John Naylor wrote:
In trying to wrap the SIMD code behind layers of abstraction, the latest
patch (and Nathan's cleanup) threw it away in almost all cases. To explain,
we need to talk about how vectorized code deals with the "tail" that is too
small for the register:1. Use a one-by-one algorithm, like we do for the pg_lfind* variants.
2. Read some junk into the register and mask off false positives from the
result.There are advantages to both depending on the situation.
Patch v5 and earlier used #2. Patch v6 used #1, so if a node16 has 15
elements or less, it will iterate over them one-by-one exactly like a
node4. Only when full with 16 will the vector path be taken. When another
entry is added, the elements are copied to the next bigger node, so there's
a *small* window where it's fast.In short, this code needs to be lower level so that we still have full
control while being portable. I will work on this, and also the related
code for node dispatch.
Is it possible to use approach #2 here, too? AFAICT space is allocated for
all of the chunks, so there wouldn't be any danger in searching all them
and discarding any results >= node->count. Granted, we're depending on the
number of chunks always being a multiple of elements-per-vector in order to
avoid the tail path, but that seems like a reasonably safe assumption that
can be covered with comments.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Thu, Sep 22, 2022 at 1:01 AM Nathan Bossart <nathandbossart@gmail.com>
wrote:
On Wed, Sep 21, 2022 at 01:17:21PM +0700, John Naylor wrote:
In short, this code needs to be lower level so that we still have full
control while being portable. I will work on this, and also the related
code for node dispatch.Is it possible to use approach #2 here, too? AFAICT space is allocated
for
all of the chunks, so there wouldn't be any danger in searching all them
and discarding any results >= node->count.
Sure, the caller could pass the maximum node capacity, and then check if
the returned index is within the range of the node count.
Granted, we're depending on the
number of chunks always being a multiple of elements-per-vector in order
to
avoid the tail path, but that seems like a reasonably safe assumption that
can be covered with comments.
Actually, we don't need to depend on that at all. When I said "junk" above,
that can be any bytes, as long as we're not reading off the end of
allocated memory. We'll never do that here, since the child pointers/values
follow. In that case, the caller can hard-code the size (it would even
happen to work now to multiply rt_node_kind by 16, to be sneaky). One thing
I want to try soon is storing fewer than 16/32 etc entries, so that the
whole node fits comfortably inside a power-of-two allocation. That would
allow us to use aset without wasting space for the smaller nodes, which
would be faster and possibly would solve the fragmentation problem Andres
referred to in
/messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de
While on the subject, I wonder how important it is to keep the chunks in
the small nodes in sorted order. That adds branches and memmove calls, and
is the whole reason for the recent "pg_lfind_ge" function.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Sep 22, 2022 at 1:46 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Thu, Sep 22, 2022 at 1:01 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Wed, Sep 21, 2022 at 01:17:21PM +0700, John Naylor wrote:
In short, this code needs to be lower level so that we still have full
control while being portable. I will work on this, and also the related
code for node dispatch.Is it possible to use approach #2 here, too? AFAICT space is allocated for
all of the chunks, so there wouldn't be any danger in searching all them
and discarding any results >= node->count.Sure, the caller could pass the maximum node capacity, and then check if the returned index is within the range of the node count.
Granted, we're depending on the
number of chunks always being a multiple of elements-per-vector in order to
avoid the tail path, but that seems like a reasonably safe assumption that
can be covered with comments.Actually, we don't need to depend on that at all. When I said "junk" above, that can be any bytes, as long as we're not reading off the end of allocated memory. We'll never do that here, since the child pointers/values follow. In that case, the caller can hard-code the size (it would even happen to work now to multiply rt_node_kind by 16, to be sneaky). One thing I want to try soon is storing fewer than 16/32 etc entries, so that the whole node fits comfortably inside a power-of-two allocation. That would allow us to use aset without wasting space for the smaller nodes, which would be faster and possibly would solve the fragmentation problem Andres referred to in
/messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de
While on the subject, I wonder how important it is to keep the chunks in the small nodes in sorted order. That adds branches and memmove calls, and is the whole reason for the recent "pg_lfind_ge" function.
Good point. While keeping the chunks in the small nodes in sorted
order is useful for visiting all keys in sorted order, additional
branches and memmove calls could be slow.
Regards,
--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Thu, Sep 22, 2022 at 1:26 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Thu, Sep 22, 2022 at 1:46 PM John Naylor
<john.naylor@enterprisedb.com> wrote:While on the subject, I wonder how important it is to keep the chunks
in the small nodes in sorted order. That adds branches and memmove calls,
and is the whole reason for the recent "pg_lfind_ge" function.
Good point. While keeping the chunks in the small nodes in sorted
order is useful for visiting all keys in sorted order, additional
branches and memmove calls could be slow.
Right, the ordering is a property that some users will need, so best to
keep it. Although the node128 doesn't have that property -- too slow to do
so, I think.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Sep 22, 2022 at 7:52 PM John Naylor <john.naylor@enterprisedb.com>
wrote:
On Thu, Sep 22, 2022 at 1:26 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
Good point. While keeping the chunks in the small nodes in sorted
order is useful for visiting all keys in sorted order, additional
branches and memmove calls could be slow.Right, the ordering is a property that some users will need, so best to
keep it. Although the node128 doesn't have that property -- too slow to do
so, I think.
Nevermind, I must have been mixing up keys and values there...
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Sep 22, 2022 at 11:46 AM John Naylor <john.naylor@enterprisedb.com>
wrote:
One thing I want to try soon is storing fewer than 16/32 etc entries, so
that the whole node fits comfortably inside a power-of-two allocation. That
would allow us to use aset without wasting space for the smaller nodes,
which would be faster and possibly would solve the fragmentation problem
Andres referred to in
/messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de
While calculating node sizes that fit within a power-of-two size, I noticed
the current base node is a bit wasteful, taking up 8 bytes. The node kind
only has a small number of values, so it doesn't really make sense to use
an enum here in the struct (in fact, Andres' prototype used a uint8 for
node_kind). We could use a bitfield for the count and kind:
uint16 -- kind and count bitfield
uint8 shift;
uint8 chunk;
That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag,
the bitfield can just go back to being count only.
Here are the v6 node kinds:
node4: 8 + 4 +(4) + 4*8 = 48 bytes
node16: 8 + 16 + 16*8 = 152
node32: 8 + 32 + 32*8 = 296
node128: 8 + 256 + 128/8 + 128*8 = 1304
node256: 8 + 256/8 + 256*8 = 2088
And here are the possible ways we could optimize nodes for space using aset
allocation. Parentheses are padding bytes. Even if my math has mistakes,
the numbers shouldn't be too far off:
node3: 4 + 3 +(1) + 3*8 = 32 bytes
node6: 4 + 6 +(6) + 6*8 = 64
node13: 4 + 13 +(7) + 13*8 = 128
node28: 4 + 28 + 28*8 = 256
node31: 4 + 256 + 32/8 + 31*8 = 512 (XXX not good)
node94: 4 + 256 + 96/8 + 94*8 = 1024
node220: 4 + 256 + 224/8 + 220*8 = 2048
node256: = 4096
The main disadvantage is that node256 would balloon in size.
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Sep 23, 2022 at 12:11 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Thu, Sep 22, 2022 at 11:46 AM John Naylor <john.naylor@enterprisedb.com> wrote:
One thing I want to try soon is storing fewer than 16/32 etc entries, so that the whole node fits comfortably inside a power-of-two allocation. That would allow us to use aset without wasting space for the smaller nodes, which would be faster and possibly would solve the fragmentation problem Andres referred to in
/messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de
While calculating node sizes that fit within a power-of-two size, I noticed the current base node is a bit wasteful, taking up 8 bytes. The node kind only has a small number of values, so it doesn't really make sense to use an enum here in the struct (in fact, Andres' prototype used a uint8 for node_kind). We could use a bitfield for the count and kind:
uint16 -- kind and count bitfield
uint8 shift;
uint8 chunk;That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag, the bitfield can just go back to being count only.
Good point, agreed.
Here are the v6 node kinds:
node4: 8 + 4 +(4) + 4*8 = 48 bytes
node16: 8 + 16 + 16*8 = 152
node32: 8 + 32 + 32*8 = 296
node128: 8 + 256 + 128/8 + 128*8 = 1304
node256: 8 + 256/8 + 256*8 = 2088And here are the possible ways we could optimize nodes for space using aset allocation. Parentheses are padding bytes. Even if my math has mistakes, the numbers shouldn't be too far off:
node3: 4 + 3 +(1) + 3*8 = 32 bytes
node6: 4 + 6 +(6) + 6*8 = 64
node13: 4 + 13 +(7) + 13*8 = 128
node28: 4 + 28 + 28*8 = 256
node31: 4 + 256 + 32/8 + 31*8 = 512 (XXX not good)
node94: 4 + 256 + 96/8 + 94*8 = 1024
node220: 4 + 256 + 224/8 + 220*8 = 2048
node256: = 4096The main disadvantage is that node256 would balloon in size.
Yeah, node31 and node256 are bloated. We probably could use slab for
node256 independently. It's worth trying a benchmark to see how it
affects the performance and the tree size.
BTW We need to consider not only aset/slab but also DSA since we
allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
the following size classes:
static const uint16 dsa_size_classes[] = {
sizeof(dsa_area_span), 0, /* special size classes */
8, 16, 24, 32, 40, 48, 56, 64, /* 8 classes separated by 8 bytes */
80, 96, 112, 128, /* 4 classes separated by 16 bytes */
160, 192, 224, 256, /* 4 classes separated by 32 bytes */
320, 384, 448, 512, /* 4 classes separated by 64 bytes */
640, 768, 896, 1024, /* 4 classes separated by 128 bytes */
1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */
2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */
5456, 6552, 7280, 8192 /* 4 classes separated by ~1024 bytes */
};
node256 will be classed as 2616, which is still not good.
Anyway, I'll implement DSA support for radix tree.
Regards,
--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wed, Sep 28, 2022 at 10:49 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
BTW We need to consider not only aset/slab but also DSA since we
allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
the following size classes:static const uint16 dsa_size_classes[] = {
[...]
Thanks for that info -- I wasn't familiar with the details of DSA. For the
non-parallel case, I plan to at least benchmark using aset because I gather
it's the most heavily optimized. I'm thinking that will allow other problem
areas to be more prominent. I'll also want to compare total context size
compared to slab to see if possibly less fragmentation makes up for other
wastage.
Along those lines, one thing I've been thinking about is the number of size
classes. There is a tradeoff between memory efficiency and number of
branches when searching/inserting. My current thinking is there is too much
coupling between size class and data type. Each size class currently uses a
different data type and a different algorithm to search and set it, which
in turn requires another branch. We've found that a larger number of size
classes leads to poor branch prediction [1]/messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de and (I imagine) code density.
I'm thinking we can use "flexible array members" for the values/pointers,
and keep the rest of the control data in the struct the same. That way, we
never have more than 4 actual "kinds" to code and branch on. As a bonus,
when migrating a node to a larger size class of the same kind, we can
simply repalloc() to the next size. To show what I mean, consider this new
table:
node2: 5 + 6 +(5)+ 2*8 = 32 bytes
node6: 5 + 6 +(5)+ 6*8 = 64
node12: 5 + 27 + 12*8 = 128
node27: 5 + 27 + 27*8 = 248(->256)
node91: 5 + 256 + 28 +(7)+ 91*8 = 1024
node219: 5 + 256 + 28 +(7)+219*8 = 2048
node256: 5 + 32 +(3)+256*8 = 2088(->4096)
Seven size classes are grouped into the four kinds.
The common base at the front is here 5 bytes because there is a new uint8
field for "capacity", which we can ignore for node256 since we assume we
can always insert/update that node. The control data is the same in each
pair, and so the offset to the pointer/value array is the same. Thus,
migration would look something like:
case FOO_KIND:
if (unlikely(count == capacity))
{
if (capacity == XYZ) /* for smaller size class of the pair */
{
<repalloc to next size class>;
capacity = next-higher-capacity;
goto do_insert;
}
else
<migrate data to next node kind>;
}
else
{
do_insert:
<...>;
break;
}
/* FALLTHROUGH */
...
One disadvantage is that this wastes some space by reserving the full set
of control data in the smaller size class of the pair, but it's usually
small compared to array size. Somewhat unrelated, we could still implement
Andres' idea [1]/messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de to dispense with the isset array in inner nodes of the
indirect array type (now node128), since we can just test if the pointer is
null.
[1]: /messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de
/messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Sep 28, 2022 at 1:18 PM John Naylor <john.naylor@enterprisedb.com>
wrote:
[stuff about size classes]
I kind of buried the lede here on one thing: If we only have 4 kinds
regardless of the number of size classes, we can use 2 bits of the pointer
for dispatch, which would only require 4-byte alignment. That should make
that technique more portable.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Sep 28, 2022 at 3:18 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Wed, Sep 28, 2022 at 10:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
BTW We need to consider not only aset/slab but also DSA since we
allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
the following size classes:static const uint16 dsa_size_classes[] = {
[...]Thanks for that info -- I wasn't familiar with the details of DSA. For the non-parallel case, I plan to at least benchmark using aset because I gather it's the most heavily optimized. I'm thinking that will allow other problem areas to be more prominent. I'll also want to compare total context size compared to slab to see if possibly less fragmentation makes up for other wastage.
Thanks!
Along those lines, one thing I've been thinking about is the number of size classes. There is a tradeoff between memory efficiency and number of branches when searching/inserting. My current thinking is there is too much coupling between size class and data type. Each size class currently uses a different data type and a different algorithm to search and set it, which in turn requires another branch. We've found that a larger number of size classes leads to poor branch prediction [1] and (I imagine) code density.
I'm thinking we can use "flexible array members" for the values/pointers, and keep the rest of the control data in the struct the same. That way, we never have more than 4 actual "kinds" to code and branch on. As a bonus, when migrating a node to a larger size class of the same kind, we can simply repalloc() to the next size.
Interesting idea. Using flexible array members for values would be
good also for the case in the future where we want to support other
value types than uint64.
With this idea, we can just repalloc() to grow to the larger size in a
pair but I'm slightly concerned that the more size class we use, the
more frequent the node needs to grow. If we want to support node
shrink, the deletion is also affected.
To show what I mean, consider this new table:
node2: 5 + 6 +(5)+ 2*8 = 32 bytes
node6: 5 + 6 +(5)+ 6*8 = 64node12: 5 + 27 + 12*8 = 128
node27: 5 + 27 + 27*8 = 248(->256)node91: 5 + 256 + 28 +(7)+ 91*8 = 1024
node219: 5 + 256 + 28 +(7)+219*8 = 2048node256: 5 + 32 +(3)+256*8 = 2088(->4096)
Seven size classes are grouped into the four kinds.
The common base at the front is here 5 bytes because there is a new uint8 field for "capacity", which we can ignore for node256 since we assume we can always insert/update that node. The control data is the same in each pair, and so the offset to the pointer/value array is the same. Thus, migration would look something like:
I think we can use a bitfield for capacity. That way, we can pack
count (9bits), kind (2bits)and capacity (4bits) in uint16.
Somewhat unrelated, we could still implement Andres' idea [1] to dispense with the isset array in inner nodes of the indirect array type (now node128), since we can just test if the pointer is null.
Right. I didn't do that to use the common logic for inner node128 and
leaf node128.
Regards,
--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On 2022-09-16 15:00:31 +0900, Masahiko Sawada wrote:
I've updated the radix tree patch. It's now separated into two patches.
cfbot notices a compiler warning:
https://cirrus-ci.com/task/6247907681632256?logs=gcc_warning#L446
[11:03:05.343] radixtree.c: In function ‘rt_iterate_next’:
[11:03:05.343] radixtree.c:1758:15: error: ‘slot’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
[11:03:05.343] 1758 | *value_p = *((uint64 *) slot);
[11:03:05.343] | ^~~~~~~~~~~~~~~~~~
Greetings,
Andres Freund
On Mon, Oct 3, 2022 at 2:04 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-09-16 15:00:31 +0900, Masahiko Sawada wrote:
I've updated the radix tree patch. It's now separated into two patches.
cfbot notices a compiler warning:
https://cirrus-ci.com/task/6247907681632256?logs=gcc_warning#L446[11:03:05.343] radixtree.c: In function ‘rt_iterate_next’:
[11:03:05.343] radixtree.c:1758:15: error: ‘slot’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
[11:03:05.343] 1758 | *value_p = *((uint64 *) slot);
[11:03:05.343] | ^~~~~~~~~~~~~~~~~~
Thanks, I'll fix it in the next version patch.
Regards,
--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Sep 23, 2022 at 12:11 AM John Naylor
<john.naylor@enterprisedb.com> wrote:On Thu, Sep 22, 2022 at 11:46 AM John Naylor <john.naylor@enterprisedb.com> wrote:
One thing I want to try soon is storing fewer than 16/32 etc entries, so that the whole node fits comfortably inside a power-of-two allocation. That would allow us to use aset without wasting space for the smaller nodes, which would be faster and possibly would solve the fragmentation problem Andres referred to in
/messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de
While calculating node sizes that fit within a power-of-two size, I noticed the current base node is a bit wasteful, taking up 8 bytes. The node kind only has a small number of values, so it doesn't really make sense to use an enum here in the struct (in fact, Andres' prototype used a uint8 for node_kind). We could use a bitfield for the count and kind:
uint16 -- kind and count bitfield
uint8 shift;
uint8 chunk;That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag, the bitfield can just go back to being count only.
Good point, agreed.
Here are the v6 node kinds:
node4: 8 + 4 +(4) + 4*8 = 48 bytes
node16: 8 + 16 + 16*8 = 152
node32: 8 + 32 + 32*8 = 296
node128: 8 + 256 + 128/8 + 128*8 = 1304
node256: 8 + 256/8 + 256*8 = 2088And here are the possible ways we could optimize nodes for space using aset allocation. Parentheses are padding bytes. Even if my math has mistakes, the numbers shouldn't be too far off:
node3: 4 + 3 +(1) + 3*8 = 32 bytes
node6: 4 + 6 +(6) + 6*8 = 64
node13: 4 + 13 +(7) + 13*8 = 128
node28: 4 + 28 + 28*8 = 256
node31: 4 + 256 + 32/8 + 31*8 = 512 (XXX not good)
node94: 4 + 256 + 96/8 + 94*8 = 1024
node220: 4 + 256 + 224/8 + 220*8 = 2048
node256: = 4096The main disadvantage is that node256 would balloon in size.
Yeah, node31 and node256 are bloated. We probably could use slab for
node256 independently. It's worth trying a benchmark to see how it
affects the performance and the tree size.BTW We need to consider not only aset/slab but also DSA since we
allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
the following size classes:static const uint16 dsa_size_classes[] = {
sizeof(dsa_area_span), 0, /* special size classes */
8, 16, 24, 32, 40, 48, 56, 64, /* 8 classes separated by 8 bytes */
80, 96, 112, 128, /* 4 classes separated by 16 bytes */
160, 192, 224, 256, /* 4 classes separated by 32 bytes */
320, 384, 448, 512, /* 4 classes separated by 64 bytes */
640, 768, 896, 1024, /* 4 classes separated by 128 bytes */
1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */
2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */
5456, 6552, 7280, 8192 /* 4 classes separated by ~1024 bytes */
};node256 will be classed as 2616, which is still not good.
Anyway, I'll implement DSA support for radix tree.
Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes
to point to its child nodes, instead of C pointers (ig, backend-local
address). I'm thinking of a straightforward approach as the first
step; inner nodes have a union of rt_node* and dsa_pointer and we
choose either one based on whether the radix tree is shared or not. We
allocate and free the shared memory for individual nodes by
dsa_allocate() and dsa_free(), respectively. Therefore we need to get
a C pointer from dsa_pointer by using dsa_get_address() while
descending the tree. I'm a bit concerned that calling
dsa_get_address() for every descent could be performance overhead but
I'm going to measure it anyway.
Regards,
--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wed, Oct 5, 2022 at 1:46 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Fri, Sep 23, 2022 at 12:11 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
Yeah, node31 and node256 are bloated. We probably could use slab for
node256 independently. It's worth trying a benchmark to see how it
affects the performance and the tree size.
This wasn't the focus of your current email, but while experimenting with
v6 I had another thought about local allocation: If we use the default slab
block size of 8192 bytes, then only 3 chunks of size 2088 can fit, right?
If so, since aset and DSA also waste at least a few hundred bytes, we could
store a useless 256-byte slot array within node256. That way, node128 and
node256 share the same start of pointers/values array, so there would be
one less branch for getting that address. In v6, rt_node_get_values and
rt_node_get_children are not inlined (asde: gcc uses a jump table for 5
kinds but not for 4), but possibly should be, and the smaller the better.
Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes
to point to its child nodes, instead of C pointers (ig, backend-local
address). I'm thinking of a straightforward approach as the first
step; inner nodes have a union of rt_node* and dsa_pointer and we
choose either one based on whether the radix tree is shared or not. We
allocate and free the shared memory for individual nodes by
dsa_allocate() and dsa_free(), respectively. Therefore we need to get
a C pointer from dsa_pointer by using dsa_get_address() while
descending the tree. I'm a bit concerned that calling
dsa_get_address() for every descent could be performance overhead but
I'm going to measure it anyway.
Are dsa pointers aligned the same as pointers to locally allocated memory?
Meaning, is the offset portion always a multiple of 4 (or 8)? It seems that
way from a glance, but I can't say for sure. If the lower 2 bits of a DSA
pointer are never set, we can tag them the same way as a regular pointer.
That same technique could help hide the latency of converting the pointer,
by the same way it would hide the latency of loading parts of a node into
CPU registers.
One concern is, handling both local and dsa cases in the same code requires
more (predictable) branches and reduces code density. That might be a
reason in favor of templating to handle each case in its own translation
unit. But that might be overkill.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Oct 5, 2022 at 6:40 PM John Naylor <john.naylor@enterprisedb.com> wrote:
On Wed, Oct 5, 2022 at 1:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Sep 23, 2022 at 12:11 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
Yeah, node31 and node256 are bloated. We probably could use slab for
node256 independently. It's worth trying a benchmark to see how it
affects the performance and the tree size.This wasn't the focus of your current email, but while experimenting with v6 I had another thought about local allocation: If we use the default slab block size of 8192 bytes, then only 3 chunks of size 2088 can fit, right? If so, since aset and DSA also waste at least a few hundred bytes, we could store a useless 256-byte slot array within node256. That way, node128 and node256 share the same start of pointers/values array, so there would be one less branch for getting that address. In v6, rt_node_get_values and rt_node_get_children are not inlined (asde: gcc uses a jump table for 5 kinds but not for 4), but possibly should be, and the smaller the better.
It would be good for performance but I'm a bit concerned that it's
highly optimized to the design of aset and DSA. Since size 2088 will
be currently classed as 2616 in DSA, DSA wastes 528 bytes. However, if
we introduce a new class of 2304 (=2048 + 256) bytes we cannot store a
useless 256-byte and the assumption will be broken.
Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes
to point to its child nodes, instead of C pointers (ig, backend-local
address). I'm thinking of a straightforward approach as the first
step; inner nodes have a union of rt_node* and dsa_pointer and we
choose either one based on whether the radix tree is shared or not. We
allocate and free the shared memory for individual nodes by
dsa_allocate() and dsa_free(), respectively. Therefore we need to get
a C pointer from dsa_pointer by using dsa_get_address() while
descending the tree. I'm a bit concerned that calling
dsa_get_address() for every descent could be performance overhead but
I'm going to measure it anyway.Are dsa pointers aligned the same as pointers to locally allocated memory? Meaning, is the offset portion always a multiple of 4 (or 8)?
I think so.
It seems that way from a glance, but I can't say for sure. If the lower 2 bits of a DSA pointer are never set, we can tag them the same way as a regular pointer. That same technique could help hide the latency of converting the pointer, by the same way it would hide the latency of loading parts of a node into CPU registers.
One concern is, handling both local and dsa cases in the same code requires more (predictable) branches and reduces code density. That might be a reason in favor of templating to handle each case in its own translation unit.
Right. We also need to support locking for shared radix tree, which
would require more branches.
Regards,
--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Thu, Oct 6, 2022 at 2:53 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Wed, Oct 5, 2022 at 6:40 PM John Naylor <john.naylor@enterprisedb.com>
wrote:
This wasn't the focus of your current email, but while experimenting
with v6 I had another thought about local allocation: If we use the default
slab block size of 8192 bytes, then only 3 chunks of size 2088 can fit,
right? If so, since aset and DSA also waste at least a few hundred bytes,
we could store a useless 256-byte slot array within node256. That way,
node128 and node256 share the same start of pointers/values array, so there
would be one less branch for getting that address. In v6,
rt_node_get_values and rt_node_get_children are not inlined (asde: gcc uses
a jump table for 5 kinds but not for 4), but possibly should be, and the
smaller the better.
It would be good for performance but I'm a bit concerned that it's
highly optimized to the design of aset and DSA. Since size 2088 will
be currently classed as 2616 in DSA, DSA wastes 528 bytes. However, if
we introduce a new class of 2304 (=2048 + 256) bytes we cannot store a
useless 256-byte and the assumption will be broken.
A new DSA class is hypothetical. A better argument against my idea is that
SLAB_DEFAULT_BLOCK_SIZE is arbitrary. FWIW, I looked at the prototype just
now and the slab block sizes are:
Max(pg_nextpower2_32((MAXALIGN(inner_class_info[i].size) + 16) * 32), 1024)
...which would be 128kB for nodemax. I'm curious about the difference.
One concern is, handling both local and dsa cases in the same code
requires more (predictable) branches and reduces code density. That might
be a reason in favor of templating to handle each case in its own
translation unit.
Right. We also need to support locking for shared radix tree, which
would require more branches.
Hmm, now it seems we'll likely want to template local vs. shared as a later
step...
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
In addition to two patches, I've attached the third patch. It's not
part of radix tree implementation but introduces a contrib module
bench_radix_tree, a tool for radix tree performance benchmarking. It
measures loading and lookup performance of both the radix tree and a
flat array.
Hi Masahiko, I've been using these benchmarks, along with my own
variations, to try various things that I've mentioned. I'm long overdue for
an update, but the picture is not yet complete.
For now, I have two questions that I can't figure out on my own:
1. There seems to be some non-obvious limit on the number of keys that are
loaded (or at least what the numbers report). This is independent of the
number of tids per block. Example below:
john=# select * from bench_shuffle_search(0, 8*1000*1000);
NOTICE: num_keys = 8000000, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 =
250000, n256 = 981
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
8000000 | 268435456 | 48000000 | 661 |
29 | 276 | 389
john=# select * from bench_shuffle_search(0, 9*1000*1000);
NOTICE: num_keys = 8388608, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 =
262144, n256 = 1028
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
8388608 | 276824064 | 54000000 | 718 |
33 | 311 | 446
The array is the right size, but nkeys hasn't kept pace. Can you reproduce
this? Attached is the patch I'm using to show the stats when running the
test. (Side note: The numbers look unfavorable for radix tree because I'm
using 1 tid per block here.)
2. I found that bench_shuffle_search() is much *faster* for traditional
binary search on an array than bench_seq_search(). I've found this to be
true in every case. This seems counterintuitive to me -- any idea why this
is? Example:
john=# select * from bench_seq_search(0, 1000000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128
= 1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 168 |
106 | 827 | 3348
john=# select * from bench_shuffle_search(0, 1000000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128
= 1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 171 |
107 | 827 | 1400
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v65-0001-Turn-on-per-node-counts-in-benchmark.patchtext/x-patch; charset=US-ASCII; name=v65-0001-Turn-on-per-node-counts-in-benchmark.patchDownload
From 43a50a385930ee340d0a3b003910c704a0ff342c Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Thu, 6 Oct 2022 09:07:41 +0700
Subject: [PATCH v65 1/5] Turn on per-node counts in benchmark
Also add gitigore, fix whitespace, and change to NOTICE
---
contrib/bench_radix_tree/.gitignore | 3 +++
contrib/bench_radix_tree/bench_radix_tree.c | 5 +++++
src/backend/lib/radixtree.c | 2 +-
src/include/lib/radixtree.h | 2 +-
4 files changed, 10 insertions(+), 2 deletions(-)
create mode 100644 contrib/bench_radix_tree/.gitignore
diff --git a/contrib/bench_radix_tree/.gitignore b/contrib/bench_radix_tree/.gitignore
new file mode 100644
index 0000000000..8830f5460d
--- /dev/null
+++ b/contrib/bench_radix_tree/.gitignore
@@ -0,0 +1,3 @@
+*data
+log/*
+results/*
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 5806ef7519..36c5218ae7 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -13,6 +13,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "lib/radixtree.h"
+#include <math.h>
#include "miscadmin.h"
#include "utils/timestamp.h"
@@ -183,6 +184,8 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
TimestampDifference(start_time, end_time, &secs, &usecs);
rt_load_ms = secs * 1000 + usecs / 1000;
+ rt_stats(rt);
+
/* measure the load time of the array */
itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
sizeof(ItemPointerData) * ntids);
@@ -292,6 +295,8 @@ bench_load_random_int(PG_FUNCTION_ARGS)
TimestampDifference(start_time, end_time, &secs, &usecs);
load_time_ms = secs * 1000 + usecs / 1000;
+ rt_stats(rt);
+
MemSet(nulls, false, sizeof(nulls));
values[0] = Int64GetDatum(rt_memory_usage(rt));
values[1] = Int64GetDatum(load_time_ms);
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index b163eac480..a84c06f0d4 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -1980,7 +1980,7 @@ rt_verify_node(rt_node *node)
void
rt_stats(radix_tree *tree)
{
- fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u, n16 = %u,n32 = %u, n128 = %u, n256 = %u",
+ elog(NOTICE, "num_keys = %lu, height = %u, n4 = %u, n16 = %u, n32 = %u, n128 = %u, n256 = %u",
tree->num_keys,
tree->root->shift / RT_NODE_SPAN,
tree->cnt[0],
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 38cc6abf4c..d5d7668617 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -15,7 +15,7 @@
#include "postgres.h"
-/* #define RT_DEBUG 1 */
+#define RT_DEBUG 1
typedef struct radix_tree radix_tree;
typedef struct rt_iter rt_iter;
--
2.37.3
On Fri, Oct 7, 2022 at 2:29 PM John Naylor <john.naylor@enterprisedb.com> wrote:
On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
In addition to two patches, I've attached the third patch. It's not
part of radix tree implementation but introduces a contrib module
bench_radix_tree, a tool for radix tree performance benchmarking. It
measures loading and lookup performance of both the radix tree and a
flat array.Hi Masahiko, I've been using these benchmarks, along with my own variations, to try various things that I've mentioned. I'm long overdue for an update, but the picture is not yet complete.
Thanks!
For now, I have two questions that I can't figure out on my own:
1. There seems to be some non-obvious limit on the number of keys that are loaded (or at least what the numbers report). This is independent of the number of tids per block. Example below:
john=# select * from bench_shuffle_search(0, 8*1000*1000);
NOTICE: num_keys = 8000000, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 250000, n256 = 981
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
8000000 | 268435456 | 48000000 | 661 | 29 | 276 | 389john=# select * from bench_shuffle_search(0, 9*1000*1000);
NOTICE: num_keys = 8388608, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 262144, n256 = 1028
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
8388608 | 276824064 | 54000000 | 718 | 33 | 311 | 446The array is the right size, but nkeys hasn't kept pace. Can you reproduce this? Attached is the patch I'm using to show the stats when running the test. (Side note: The numbers look unfavorable for radix tree because I'm using 1 tid per block here.)
Yes, I can reproduce this. In tid_to_key_off() we need to cast to
uint64 when packing offset number and block number:
tid_i = ItemPointerGetOffsetNumber(tid);
tid_i |= ItemPointerGetBlockNumber(tid) << shift;
2. I found that bench_shuffle_search() is much *faster* for traditional binary search on an array than bench_seq_search(). I've found this to be true in every case. This seems counterintuitive to me -- any idea why this is? Example:
john=# select * from bench_seq_search(0, 1000000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 168 | 106 | 827 | 3348john=# select * from bench_shuffle_search(0, 1000000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 171 | 107 | 827 | 1400
Ugh, in shuffle_itemptrs(), we shuffled itemptrs instead of itemptr:
for (int i = 0; i < nitems - 1; i++)
{
int j = shuffle_randrange(&state, i, nitems - 1);
ItemPointerData t = itemptrs[j];
itemptrs[j] = itemptrs[i];
itemptrs[i] = t;
With the fix, the results on my environment were:
postgres(1:4093192)=# select * from bench_seq_search(0, 10000000);
2022-10-07 16:57:03.124 JST [4093192] LOG: num_keys = 10000000,
height = 3, n4 = 0, n16 = 1, n32 = 312500, n128 = 0, n256 = 1226
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
----------+------------------+---------------------+------------+---------------+--------------+-----------------
10000000 | 101826560 | 1800000000 | 846 |
486 | 6096 | 21128
(1 row)
Time: 28975.566 ms (00:28.976)
postgres(1:4093192)=# select * from bench_shuffle_search(0, 10000000);
2022-10-07 16:57:37.476 JST [4093192] LOG: num_keys = 10000000,
height = 3, n4 = 0, n16 = 1, n32 = 312500, n128 = 0, n256 = 1226
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
----------+------------------+---------------------+------------+---------------+--------------+-----------------
10000000 | 101826560 | 1800000000 | 845 |
484 | 32700 | 152583
(1 row)
I've attached a patch to fix them. Also, I realized that bsearch()
could be optimized out so I added code to prevent it:
Regards,
--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
fix_bench_radix_tree.patchapplication/x-patch; name=fix_bench_radix_tree.patchDownload
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 0778da2d7b..d4c8040357 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -27,20 +27,17 @@ PG_FUNCTION_INFO_V1(bench_shuffle_search);
PG_FUNCTION_INFO_V1(bench_load_random_int);
PG_FUNCTION_INFO_V1(bench_fixed_height_search);
-static radix_tree *rt = NULL;
-static ItemPointer itemptrs = NULL;
-
static uint64
tid_to_key_off(ItemPointer tid, uint32 *off)
{
- uint32 upper;
+ uint64 upper;
uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
int64 tid_i;
Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
tid_i = ItemPointerGetOffsetNumber(tid);
- tid_i |= ItemPointerGetBlockNumber(tid) << shift;
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
*off = tid_i & ((1 << 6) - 1);
@@ -70,10 +67,10 @@ shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
for (int i = 0; i < nitems - 1; i++)
{
int j = shuffle_randrange(&state, i, nitems - 1);
- ItemPointerData t = itemptrs[j];
+ ItemPointerData t = itemptr[j];
- itemptrs[j] = itemptrs[i];
- itemptrs[i] = t;
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
}
}
@@ -138,6 +135,8 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
{
BlockNumber minblk = PG_GETARG_INT32(0);
BlockNumber maxblk = PG_GETARG_INT32(1);
+ ItemPointer itemptrs = NULL;
+ radix_tree *rt = NULL;
uint64 ntids;
uint64 key;
uint64 last_key = PG_UINT64_MAX;;
@@ -185,6 +184,8 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
TimestampDifference(start_time, end_time, &secs, &usecs);
rt_load_ms = secs * 1000 + usecs / 1000;
+ rt_stats(rt);
+
/* measure the load time of the array */
itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
sizeof(ItemPointerData) * ntids);
@@ -210,12 +211,14 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
ItemPointer tid = &(tids[i]);
uint64 key, val;
uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being optimized out */
CHECK_FOR_INTERRUPTS();
key = tid_to_key_off(tid, &off);
- rt_search(rt, key, &val);
+ ret = rt_search(rt, key, &val);
+ (void) ret;
}
end_time = GetCurrentTimestamp();
TimestampDifference(start_time, end_time, &secs, &usecs);
@@ -226,12 +229,16 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
for (int i = 0; i < ntids; i++)
{
ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being optimized out */
- bsearch((void *) tid,
- (void *) itemptrs,
- ntids,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
}
end_time = GetCurrentTimestamp();
TimestampDifference(start_time, end_time, &secs, &usecs);
@@ -294,6 +301,8 @@ bench_load_random_int(PG_FUNCTION_ARGS)
TimestampDifference(start_time, end_time, &secs, &usecs);
load_time_ms = secs * 1000 + usecs / 1000;
+ rt_stats(rt);
+
MemSet(nulls, false, sizeof(nulls));
values[0] = Int64GetDatum(rt_memory_usage(rt));
values[1] = Int64GetDatum(load_time_ms);
The following is not quite a full review, but has plenty to think about.
There is too much to cover at once, and I have to start somewhere...
My main concerns are that internal APIs:
1. are difficult to follow
2. lead to poor branch prediction and too many function calls
Some of the measurements are picking on the SIMD search code, but I go into
details in order to demonstrate how a regression there can go completely
unnoticed. Hopefully the broader themes are informative.
On Fri, Oct 7, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
[fixed benchmarks]
Thanks for that! Now I can show clear results on some aspects in a simple
way. The attached patches (apply on top of v6) are not intended to be
incorporated as-is quite yet, but do point the way to some reorganization
that I think is necessary. I've done some testing on loading, but will
leave it out for now in the interest of length.
0001-0003 are your performance test fix and and some small conveniences for
testing. Binary search is turned off, for example, because we know it
already. And the sleep call is so I can run perf in a different shell
session, on only the search portion.
Note the v6 test loads all block numbers in the range. Since the test item
ids are all below 64 (reasonable), there are always 32 leaf chunks, so all
the leaves are node32 and completely full. This had the effect of never
taking the byte-wise loop in the proposed pg_lsearch function. These two
aspects make this an easy case for the branch predictor:
john=# select * from bench_seq_search(0, 1*1000*1000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128
= 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 167 |
0 | 822 | 0
1,470,141,841 branches:u
63,693 branch-misses:u # 0.00% of all
branches
john=# select * from bench_shuffle_search(0, 1*1000*1000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128
= 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 168 |
0 | 2174 | 0
1,470,142,569 branches:u
15,023,983 branch-misses:u # 1.02% of all branches
0004 randomizes block selection in the load part of the search test so that
each block has a 50% chance of being loaded. Note that now we have many
node16s where we had none before. Although node 16 and node32 appear to
share the same path in the switch statement of rt_node_search(), the chunk
comparison and node_get_values() calls each must go through different
branches. The shuffle case is most affected, but even the sequential case
slows down. (The leaves are less full -> there are more of them, so memory
use is larger, but it shouldn't matter much, in the sequential case at
least)
john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889,
n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 173 |
0 | 907 | 0
1,684,114,926 branches:u
1,989,901 branch-misses:u # 0.12% of all branches
john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889,
n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 173 |
0 | 2890 | 0
1,684,115,844 branches:u
34,215,740 branch-misses:u # 2.03% of all branches
0005 replaces pg_lsearch with a branch-free SIMD search. Note that it
retains full portability and gains predictable performance. For
demonstration, it's used on all three linear-search types. Although I'm
sure it'd be way too slow for node4, this benchmark hardly has any so it's
ok.
john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889,
n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 176 |
0 | 867 | 0
1,469,540,357 branches:u
96,678 branch-misses:u # 0.01% of all
branches
john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889,
n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 171 |
0 | 2530 | 0
1,469,540,533 branches:u
15,019,975 branch-misses:u # 1.02% of all branches
0006 removes node16, and 0007 avoids a function call to introspect node
type. 0006 is really to make 0007 simpler to code. The crucial point here
is that calling out to rt_node_get_values/children() to figure out what
type we are is costly. With these patches, searching an unevenly populated
load is the same or faster than the original sequential load, despite
taking twice as much memory. (And, as I've noted before, decoupling size
class from node kind would win the memory back.)
john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256
= 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 171 |
0 | 717 | 0
1,349,614,294 branches:u
1,313 branch-misses:u # 0.00% of all
branches
john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256
= 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 172 |
0 | 2202 | 0
1,349,614,741 branches:u
30,592 branch-misses:u # 0.00% of all
branches
Expanding this point, once a path branches based on node kind, there should
be no reason to ever forget the kind. Ther abstractions in v6 have
disadvantages. I understand the reasoning -- to reduce duplication of code.
However, done this way, less code in the text editor leads to *more* code
(i.e. costly function calls and branches) on the machine level.
I haven't looked at insert/load performance carefully, but it's clear it
suffers from the same amnesia. prepare_node_for_insert() branches based on
the kind. If it must call rt_node_grow(), that function has no idea where
it came from and must branch again. When prepare_node_for_insert() returns
we again have no idea what the kind is, so must branch again. And if we are
one of the three linear-search nodes, we later do another function call,
where we encounter a 5-way jump table because the caller could be anything
at all.
Some of this could be worked around with always-inline functions to which
we pass a const node kind, and let the compiler get rid of the branches
etc. But many cases are probably not even worth doing that. For example, I
don't think prepare_node_for_insert() is a useful abstraction to begin
with. It returns an index, but only for linear nodes. Lookup nodes get a
return value of zero. There is not enough commonality here.
Along the same lines, there are a number of places that have branches as a
consequence of treating inner nodes and leaves with the same api:
rt_node_iterate_next
chunk_array_node_get_slot
node_128/256_get_slot
rt_node_search
I'm leaning towards splitting these out into specialized functions for each
inner and leaf. This is a bit painful for the last one, but perhaps if we
are resigned to templating the shared-mem case, maybe we can template some
of the inner/leaf stuff. Something to think about for later, but for now I
believe we have to accept some code duplication as a prerequisite for
decent performance as well as readability.
For the next steps, we need to proceed cautiously because there is a lot in
the air at the moment. Here are some aspects I would find desirable. If
there are impracticalities I haven't thought of, we can discuss further. I
don't pretend to know the practical consequences of every change I mention.
- If you have started coding the shared memory case, I'd advise to continue
so we can see what that looks like. If that has not gotten beyond the
design stage, I'd like to first see an attempt at tearing down some of the
clumsier abstractions in the current patch.
- As a "smoke test", there should ideally be nothing as general as
rt_node_get_children/values(). We should ideally always know what kind we
are if we found out earlier.
- For distinguishing between linear nodes, perhaps some always-inline
functions can help hide details. But at the same time, trying to treat them
the same is not always worthwhile.
- Start to separate treatment of inner/leaves and see how it goes.
- I firmly believe we only need 4 node *kinds*, and later we can decouple
the size classes as a separate concept. I'm willing to put serious time
into that once the broad details are right. I will also investigate pointer
tagging if we can confirm that can work similarly for dsa pointers.
Regarding size class decoupling, I'll respond to a point made earlier:
On Fri, Sep 30, 2022 at 10:47 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
With this idea, we can just repalloc() to grow to the larger size in a
pair but I'm slightly concerned that the more size class we use, the
more frequent the node needs to grow.
Well, yes, but that's orthogonal. For example, v6 has 5 node kinds. Imagine
that we have 4 node kinds, but the SIMD node kind used 2 size classes. Then
the nodes would grow at *exactly* the same frequency as they do today. I
listed many ways a size class could fit into a power-of-two (and there are
more), but we have a choice in how many to actually use. It's a trade off
between memory usage and complexity.
If we want to support node
shrink, the deletion is also affected.
Not necessarily. We don't have to shrink at the same granularity as
growing. My evidence is simple: we don't shrink at all now. :-)
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Oct 10, 2022 at 12:16 PM John Naylor <john.naylor@enterprisedb.com>
wrote:
Thanks for that! Now I can show clear results on some aspects in a simple
way. The attached patches (apply on top of v6)
Forgot the patchset...
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
Hi,
On Mon, Oct 10, 2022 at 2:16 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
The following is not quite a full review, but has plenty to think about. There is too much to cover at once, and I have to start somewhere...
My main concerns are that internal APIs:
1. are difficult to follow
2. lead to poor branch prediction and too many function callsSome of the measurements are picking on the SIMD search code, but I go into details in order to demonstrate how a regression there can go completely unnoticed. Hopefully the broader themes are informative.
On Fri, Oct 7, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
[fixed benchmarks]
Thanks for that! Now I can show clear results on some aspects in a simple way. The attached patches (apply on top of v6) are not intended to be incorporated as-is quite yet, but do point the way to some reorganization that I think is necessary. I've done some testing on loading, but will leave it out for now in the interest of length.
0001-0003 are your performance test fix and and some small conveniences for testing. Binary search is turned off, for example, because we know it already. And the sleep call is so I can run perf in a different shell session, on only the search portion.
Note the v6 test loads all block numbers in the range. Since the test item ids are all below 64 (reasonable), there are always 32 leaf chunks, so all the leaves are node32 and completely full. This had the effect of never taking the byte-wise loop in the proposed pg_lsearch function. These two aspects make this an easy case for the branch predictor:
john=# select * from bench_seq_search(0, 1*1000*1000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 167 | 0 | 822 | 01,470,141,841 branches:u
63,693 branch-misses:u # 0.00% of all branchesjohn=# select * from bench_shuffle_search(0, 1*1000*1000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 168 | 0 | 2174 | 01,470,142,569 branches:u
15,023,983 branch-misses:u # 1.02% of all branches0004 randomizes block selection in the load part of the search test so that each block has a 50% chance of being loaded. Note that now we have many node16s where we had none before. Although node 16 and node32 appear to share the same path in the switch statement of rt_node_search(), the chunk comparison and node_get_values() calls each must go through different branches. The shuffle case is most affected, but even the sequential case slows down. (The leaves are less full -> there are more of them, so memory use is larger, but it shouldn't matter much, in the sequential case at least)
john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 173 | 0 | 907 | 01,684,114,926 branches:u
1,989,901 branch-misses:u # 0.12% of all branchesjohn=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 173 | 0 | 2890 | 01,684,115,844 branches:u
34,215,740 branch-misses:u # 2.03% of all branches0005 replaces pg_lsearch with a branch-free SIMD search. Note that it retains full portability and gains predictable performance. For demonstration, it's used on all three linear-search types. Although I'm sure it'd be way too slow for node4, this benchmark hardly has any so it's ok.
john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 176 | 0 | 867 | 01,469,540,357 branches:u
96,678 branch-misses:u # 0.01% of all branchesjohn=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 171 | 0 | 2530 | 01,469,540,533 branches:u
15,019,975 branch-misses:u # 1.02% of all branches0006 removes node16, and 0007 avoids a function call to introspect node type. 0006 is really to make 0007 simpler to code. The crucial point here is that calling out to rt_node_get_values/children() to figure out what type we are is costly. With these patches, searching an unevenly populated load is the same or faster than the original sequential load, despite taking twice as much memory. (And, as I've noted before, decoupling size class from node kind would win the memory back.)
john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 171 | 0 | 717 | 01,349,614,294 branches:u
1,313 branch-misses:u # 0.00% of all branchesjohn=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 172 | 0 | 2202 | 01,349,614,741 branches:u
30,592 branch-misses:u # 0.00% of all branchesExpanding this point, once a path branches based on node kind, there should be no reason to ever forget the kind. Ther abstractions in v6 have disadvantages. I understand the reasoning -- to reduce duplication of code. However, done this way, less code in the text editor leads to *more* code (i.e. costly function calls and branches) on the machine level.
Right. When updating the patch from v4 to v5, I've eliminated the
duplication of code between each node type as much as possible, which
in turn produced more code on the machine level. The resulst of your
experiment clearly showed the bad side of this work. FWIW I've also
confirmed your changes in my environment (I've added the third
argument to turn on and off the randomizes block selection proposed in
0004 patch):
* w/o patches
postgres(1:361692)=# select * from bench_seq_search(0, 1 * 1000 * 1000, false);
2022-10-14 11:33:15.460 JST [361692] LOG: num_keys = 1000000, height
= 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 87 |
| 462 |
(1 row)
1590104944 branches:u # 3.430 G/sec
65957 branch-misses:u # 0.00% of all branches
postgres(1:361692)=# select * from bench_seq_search(0, 2 * 1000 * 1000, true);
2022-10-14 11:33:28.934 JST [361692] LOG: num_keys = 999654, height =
2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 91 |
| 497 |
(1 row)
1748249456 branches:u # 3.506 G/sec
481074 branch-misses:u # 0.03% of all branches
postgres(1:361692)=# select * from bench_shuffle_search(0, 1 * 1000 *
1000, false);
2022-10-14 11:33:38.378 JST [361692] LOG: num_keys = 1000000, height
= 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 86 |
| 1290 |
(1 row)
1590105370 branches:u # 1.231 G/sec
15039443 branch-misses:u # 0.95% of all branches
Time: 4166.346 ms (00:04.166)
postgres(1:361692)=# select * from bench_shuffle_search(0, 2 * 1000 *
1000, true);
2022-10-14 11:33:51.556 JST [361692] LOG: num_keys = 999654, height =
2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 90 |
| 1536 |
(1 row)
1748250497 branches:u # 1.137 G/sec
28125016 branch-misses:u # 1.61% of all branches
* w/ all patches
postgres(1:360358)=# select * from bench_seq_search(0, 1 * 1000 * 1000, false);
2022-10-14 11:29:27.232 JST [360358] LOG: num_keys = 1000000, height
= 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 81 |
| 432 |
(1 row)
1380062209 branches:u # 3.185 G/sec
1066 branch-misses:u # 0.00% of all branches
postgres(1:360358)=# select * from bench_seq_search(0, 2 * 1000 * 1000, true);
2022-10-14 11:29:46.380 JST [360358] LOG: num_keys = 999654, height =
2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 88 |
| 438 |
(1 row)
1379640815 branches:u # 3.133 G/sec
1332 branch-misses:u # 0.00% of all branches
postgres(1:360358)=# select * from bench_shuffle_search(0, 1 * 1000 *
1000, false);
2022-10-14 11:30:00.943 JST [360358] LOG: num_keys = 1000000, height
= 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 81 |
| 994 |
(1 row)
1380062386 branches:u # 1.386 G/sec
18368 branch-misses:u # 0.00% of all branches
postgres(1:360358)=# select * from bench_shuffle_search(0, 2 * 1000 *
1000, true);
2022-10-14 11:30:15.944 JST [360358] LOG: num_keys = 999654, height =
2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 88 |
| 1098 |
(1 row)
1379641503 branches:u # 1.254 G/sec
18973 branch-misses:u # 0.00% of all branches
I haven't looked at insert/load performance carefully, but it's clear it suffers from the same amnesia. prepare_node_for_insert() branches based on the kind. If it must call rt_node_grow(), that function has no idea where it came from and must branch again. When prepare_node_for_insert() returns we again have no idea what the kind is, so must branch again. And if we are one of the three linear-search nodes, we later do another function call, where we encounter a 5-way jump table because the caller could be anything at all.
Some of this could be worked around with always-inline functions to which we pass a const node kind, and let the compiler get rid of the branches etc. But many cases are probably not even worth doing that. For example, I don't think prepare_node_for_insert() is a useful abstraction to begin with. It returns an index, but only for linear nodes. Lookup nodes get a return value of zero. There is not enough commonality here.
Agreed.
Along the same lines, there are a number of places that have branches as a consequence of treating inner nodes and leaves with the same api:
rt_node_iterate_next
chunk_array_node_get_slot
node_128/256_get_slot
rt_node_searchI'm leaning towards splitting these out into specialized functions for each inner and leaf. This is a bit painful for the last one, but perhaps if we are resigned to templating the shared-mem case, maybe we can template some of the inner/leaf stuff. Something to think about for later, but for now I believe we have to accept some code duplication as a prerequisite for decent performance as well as readability.
Agreed.
For the next steps, we need to proceed cautiously because there is a lot in the air at the moment. Here are some aspects I would find desirable. If there are impracticalities I haven't thought of, we can discuss further. I don't pretend to know the practical consequences of every change I mention.
- If you have started coding the shared memory case, I'd advise to continue so we can see what that looks like. If that has not gotten beyond the design stage, I'd like to first see an attempt at tearing down some of the clumsier abstractions in the current patch.
- As a "smoke test", there should ideally be nothing as general as rt_node_get_children/values(). We should ideally always know what kind we are if we found out earlier.
- For distinguishing between linear nodes, perhaps some always-inline functions can help hide details. But at the same time, trying to treat them the same is not always worthwhile.
- Start to separate treatment of inner/leaves and see how it goes.
Since I've not started coding the shared memory case seriously, I'm
going to start with eliminating abstractions and splitting the
treatment of inner and leaf nodes.
- I firmly believe we only need 4 node *kinds*, and later we can decouple the size classes as a separate concept. I'm willing to put serious time into that once the broad details are right. I will also investigate pointer tagging if we can confirm that can work similarly for dsa pointers.
I'll keep 4 node kinds. And we can later try to introduce classes into
each node kind.
Regarding size class decoupling, I'll respond to a point made earlier:
On Fri, Sep 30, 2022 at 10:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
With this idea, we can just repalloc() to grow to the larger size in a
pair but I'm slightly concerned that the more size class we use, the
more frequent the node needs to grow.Well, yes, but that's orthogonal. For example, v6 has 5 node kinds. Imagine that we have 4 node kinds, but the SIMD node kind used 2 size classes. Then the nodes would grow at *exactly* the same frequency as they do today. I listed many ways a size class could fit into a power-of-two (and there are more), but we have a choice in how many to actually use. It's a trade off between memory usage and complexity.
Agreed.
Regards,
--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Fri, Oct 14, 2022 at 4:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Hi,
On Mon, Oct 10, 2022 at 2:16 PM John Naylor
<john.naylor@enterprisedb.com> wrote:The following is not quite a full review, but has plenty to think about. There is too much to cover at once, and I have to start somewhere...
My main concerns are that internal APIs:
1. are difficult to follow
2. lead to poor branch prediction and too many function callsSome of the measurements are picking on the SIMD search code, but I go into details in order to demonstrate how a regression there can go completely unnoticed. Hopefully the broader themes are informative.
On Fri, Oct 7, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
[fixed benchmarks]
Thanks for that! Now I can show clear results on some aspects in a simple way. The attached patches (apply on top of v6) are not intended to be incorporated as-is quite yet, but do point the way to some reorganization that I think is necessary. I've done some testing on loading, but will leave it out for now in the interest of length.
0001-0003 are your performance test fix and and some small conveniences for testing. Binary search is turned off, for example, because we know it already. And the sleep call is so I can run perf in a different shell session, on only the search portion.
Note the v6 test loads all block numbers in the range. Since the test item ids are all below 64 (reasonable), there are always 32 leaf chunks, so all the leaves are node32 and completely full. This had the effect of never taking the byte-wise loop in the proposed pg_lsearch function. These two aspects make this an easy case for the branch predictor:
john=# select * from bench_seq_search(0, 1*1000*1000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 167 | 0 | 822 | 01,470,141,841 branches:u
63,693 branch-misses:u # 0.00% of all branchesjohn=# select * from bench_shuffle_search(0, 1*1000*1000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 168 | 0 | 2174 | 01,470,142,569 branches:u
15,023,983 branch-misses:u # 1.02% of all branches0004 randomizes block selection in the load part of the search test so that each block has a 50% chance of being loaded. Note that now we have many node16s where we had none before. Although node 16 and node32 appear to share the same path in the switch statement of rt_node_search(), the chunk comparison and node_get_values() calls each must go through different branches. The shuffle case is most affected, but even the sequential case slows down. (The leaves are less full -> there are more of them, so memory use is larger, but it shouldn't matter much, in the sequential case at least)
john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 173 | 0 | 907 | 01,684,114,926 branches:u
1,989,901 branch-misses:u # 0.12% of all branchesjohn=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 173 | 0 | 2890 | 01,684,115,844 branches:u
34,215,740 branch-misses:u # 2.03% of all branches0005 replaces pg_lsearch with a branch-free SIMD search. Note that it retains full portability and gains predictable performance. For demonstration, it's used on all three linear-search types. Although I'm sure it'd be way too slow for node4, this benchmark hardly has any so it's ok.
john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 176 | 0 | 867 | 01,469,540,357 branches:u
96,678 branch-misses:u # 0.01% of all branchesjohn=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 171 | 0 | 2530 | 01,469,540,533 branches:u
15,019,975 branch-misses:u # 1.02% of all branches0006 removes node16, and 0007 avoids a function call to introspect node type. 0006 is really to make 0007 simpler to code. The crucial point here is that calling out to rt_node_get_values/children() to figure out what type we are is costly. With these patches, searching an unevenly populated load is the same or faster than the original sequential load, despite taking twice as much memory. (And, as I've noted before, decoupling size class from node kind would win the memory back.)
john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 171 | 0 | 717 | 01,349,614,294 branches:u
1,313 branch-misses:u # 0.00% of all branchesjohn=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 172 | 0 | 2202 | 01,349,614,741 branches:u
30,592 branch-misses:u # 0.00% of all branchesExpanding this point, once a path branches based on node kind, there should be no reason to ever forget the kind. Ther abstractions in v6 have disadvantages. I understand the reasoning -- to reduce duplication of code. However, done this way, less code in the text editor leads to *more* code (i.e. costly function calls and branches) on the machine level.
Right. When updating the patch from v4 to v5, I've eliminated the
duplication of code between each node type as much as possible, which
in turn produced more code on the machine level. The resulst of your
experiment clearly showed the bad side of this work. FWIW I've also
confirmed your changes in my environment (I've added the third
argument to turn on and off the randomizes block selection proposed in
0004 patch):* w/o patches
postgres(1:361692)=# select * from bench_seq_search(0, 1 * 1000 * 1000, false);
2022-10-14 11:33:15.460 JST [361692] LOG: num_keys = 1000000, height
= 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 87 |
| 462 |
(1 row)1590104944 branches:u # 3.430 G/sec
65957 branch-misses:u # 0.00% of all branchespostgres(1:361692)=# select * from bench_seq_search(0, 2 * 1000 * 1000, true);
2022-10-14 11:33:28.934 JST [361692] LOG: num_keys = 999654, height =
2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 91 |
| 497 |
(1 row)1748249456 branches:u # 3.506 G/sec
481074 branch-misses:u # 0.03% of all branchespostgres(1:361692)=# select * from bench_shuffle_search(0, 1 * 1000 *
1000, false);
2022-10-14 11:33:38.378 JST [361692] LOG: num_keys = 1000000, height
= 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 86 |
| 1290 |
(1 row)1590105370 branches:u # 1.231 G/sec
15039443 branch-misses:u # 0.95% of all branchesTime: 4166.346 ms (00:04.166)
postgres(1:361692)=# select * from bench_shuffle_search(0, 2 * 1000 *
1000, true);
2022-10-14 11:33:51.556 JST [361692] LOG: num_keys = 999654, height =
2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 90 |
| 1536 |
(1 row)1748250497 branches:u # 1.137 G/sec
28125016 branch-misses:u # 1.61% of all branches* w/ all patches
postgres(1:360358)=# select * from bench_seq_search(0, 1 * 1000 * 1000, false);
2022-10-14 11:29:27.232 JST [360358] LOG: num_keys = 1000000, height
= 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 81 |
| 432 |
(1 row)1380062209 branches:u # 3.185 G/sec
1066 branch-misses:u # 0.00% of all branchespostgres(1:360358)=# select * from bench_seq_search(0, 2 * 1000 * 1000, true);
2022-10-14 11:29:46.380 JST [360358] LOG: num_keys = 999654, height =
2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 88 |
| 438 |
(1 row)1379640815 branches:u # 3.133 G/sec
1332 branch-misses:u # 0.00% of all branchespostgres(1:360358)=# select * from bench_shuffle_search(0, 1 * 1000 *
1000, false);
2022-10-14 11:30:00.943 JST [360358] LOG: num_keys = 1000000, height
= 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 81 |
| 994 |
(1 row)1380062386 branches:u # 1.386 G/sec
18368 branch-misses:u # 0.00% of all branchespostgres(1:360358)=# select * from bench_shuffle_search(0, 2 * 1000 *
1000, true);
2022-10-14 11:30:15.944 JST [360358] LOG: num_keys = 999654, height =
2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 88 |
| 1098 |
(1 row)1379641503 branches:u # 1.254 G/sec
18973 branch-misses:u # 0.00% of all branchesI haven't looked at insert/load performance carefully, but it's clear it suffers from the same amnesia. prepare_node_for_insert() branches based on the kind. If it must call rt_node_grow(), that function has no idea where it came from and must branch again. When prepare_node_for_insert() returns we again have no idea what the kind is, so must branch again. And if we are one of the three linear-search nodes, we later do another function call, where we encounter a 5-way jump table because the caller could be anything at all.
Some of this could be worked around with always-inline functions to which we pass a const node kind, and let the compiler get rid of the branches etc. But many cases are probably not even worth doing that. For example, I don't think prepare_node_for_insert() is a useful abstraction to begin with. It returns an index, but only for linear nodes. Lookup nodes get a return value of zero. There is not enough commonality here.
Agreed.
Along the same lines, there are a number of places that have branches as a consequence of treating inner nodes and leaves with the same api:
rt_node_iterate_next
chunk_array_node_get_slot
node_128/256_get_slot
rt_node_searchI'm leaning towards splitting these out into specialized functions for each inner and leaf. This is a bit painful for the last one, but perhaps if we are resigned to templating the shared-mem case, maybe we can template some of the inner/leaf stuff. Something to think about for later, but for now I believe we have to accept some code duplication as a prerequisite for decent performance as well as readability.
Agreed.
For the next steps, we need to proceed cautiously because there is a lot in the air at the moment. Here are some aspects I would find desirable. If there are impracticalities I haven't thought of, we can discuss further. I don't pretend to know the practical consequences of every change I mention.
- If you have started coding the shared memory case, I'd advise to continue so we can see what that looks like. If that has not gotten beyond the design stage, I'd like to first see an attempt at tearing down some of the clumsier abstractions in the current patch.
- As a "smoke test", there should ideally be nothing as general as rt_node_get_children/values(). We should ideally always know what kind we are if we found out earlier.
- For distinguishing between linear nodes, perhaps some always-inline functions can help hide details. But at the same time, trying to treat them the same is not always worthwhile.
- Start to separate treatment of inner/leaves and see how it goes.Since I've not started coding the shared memory case seriously, I'm
going to start with eliminating abstractions and splitting the
treatment of inner and leaf nodes.
I've attached updated PoC patches for discussion and cfbot. From the
previous version, I mainly changed the following things:
* Separate treatment of inner and leaf nodes
* Pack both the node kind and node count to an uint16 value.
I've also made a change in functions in bench_radix_tree test module:
the third argument of bench_seq/shuffle_search() is a flag to turn on
and off the randomizes block selection. The results of performance
tests in my environment are:
postgres(1:1665989)=# select * from bench_seq_search(0, 1* 1000 * 1000, false);
2022-10-24 14:29:40.705 JST [1665989] LOG: num_keys = 1000000, height
= 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 9871104 | 180000000 | 65 |
| 248 |
(1 row)
postgres(1:1665989)=# select * from bench_seq_search(0, 2* 1000 * 1000, true);
2022-10-24 14:29:47.999 JST [1665989] LOG: num_keys = 999654, height
= 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 19680736 | 179937720 | 71 |
| 237 |
(1 row)
postgres(1:1665989)=# select * from bench_shuffle_search(0, 1 * 1000 *
1000, false);
2022-10-24 14:29:55.955 JST [1665989] LOG: num_keys = 1000000, height
= 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 9871104 | 180000000 | 65 |
| 641 |
(1 row)
postgres(1:1665989)=# select * from bench_shuffle_search(0, 2 * 1000 *
1000, true);
2022-10-24 14:30:04.140 JST [1665989] LOG: num_keys = 999654, height
= 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 19680736 | 179937720 | 71 |
| 654 |
(1 row)
I've not done SIMD part seriously yet. But overall the performance
seems good so far. If we agree with the current approach, I think we
can proceed with the verification of decoupling node sizes from node
kind. And I'll investigate DSA support.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/x-patch; name=0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload
From fcf76629b46732b56e424111f3fb8b53c05fd07a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [POC PATCH 1/3] introduce vector8_min and vector8_highbit_mask
---
src/include/port/simd.h | 62 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 62 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..039d7e5235 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -60,6 +60,15 @@ typedef uint32x4_t Vector32;
typedef uint64 Vector8;
#endif
+/*
+ * Some of the functions with SIMD implementations use bitwise operations
+ * available in pg_bitutils.h. There are currently no non-SIMD implementations
+ * that require these bitwise operations.
+ */
+#ifndef USE_NO_SIMD
+#include "port/pg_bitutils.h"
+#endif
+
/* load/store operations */
static inline void vector8_load(Vector8 *v, const uint8 *s);
#ifndef USE_NO_SIMD
@@ -79,6 +88,8 @@ static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline int vector8_find(const Vector8 v, const uint8 c);
+static inline int vector8_find_ge(const Vector8 v, const uint8 c);
#endif
/* arithmetic operations */
@@ -262,6 +273,27 @@ vector8_has_le(const Vector8 v, const uint8 c)
return result;
}
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#else /* USE_NO_SIMD */
+ Vector8 r = 0;
+ uint8 *rp = (uint8 *) &r;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]);
+
+ return r;
+#endif
+}
+
/*
* Return true if the high bit of any element is set
*/
@@ -277,6 +309,36 @@ vector8_is_highbit_set(const Vector8 v)
#endif
}
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+ uint32 mask = 0;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+ return mask;
+#endif
+}
+
/*
* Exactly like vector8_is_highbit_set except for the input type, so it
* looks at each byte separately.
--
2.31.1
0002-Add-radix-implementation.patchapplication/x-patch; name=0002-Add-radix-implementation.patchDownload
From 6cd239b14d521f2f1377730874c27b4eb9281217 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [POC PATCH 2/3] Add radix implementation.
---
src/backend/lib/Makefile | 1 +
src/backend/lib/radixtree.c | 2439 +++++++++++++++++
src/include/lib/radixtree.h | 42 +
src/test/modules/Makefile | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 28 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 504 ++++
.../test_radixtree/test_radixtree.control | 4 +
12 files changed, 3068 insertions(+)
create mode 100644 src/backend/lib/radixtree.c
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
integerset.o \
knapsack.o \
pairingheap.o \
+ radixtree.o \
rbtree.o \
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..93c81b843f
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2439 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
+ * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
+ * Also, there is no support for path compression and lazy path expansion. The
+ * radix tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves". We
+ * choose it to avoid an additional pointer traversal. It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create - Create a new, empty radix tree
+ * rt_free - Free the radix tree
+ * rt_search - Search a key-value pair
+ * rt_set - Set a key-value pair
+ * rt_delete - Delete a key-value pair
+ * rt_begin_iterate - Begin iterating through all key-value pairs
+ * rt_iterate_next - Return next key-value pair, if any
+ * rt_end_iter - End iteration
+ * rt_memory_usage - Get the memory usage
+ * rt_num_entries - Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-128 */
+#define RT_NODE_128_INVALID_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+ RT_ACTION_FIND = 0, /* find the key-value */
+ RT_ACTION_DELETE, /* delete the key-value */
+} rt_action;
+
+/* Base type for all nodes types */
+typedef struct rt_node
+{
+ /* The number of children and the node kind */
+ uint16 info;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+ uint8 chunk;
+} rt_node;
+
+/*
+ * Flags and masks for 'info'.
+ *
+ * The lowest 9 bits of 'info' represent the number of children in the node, and
+ * the next 2 bits are node kind.
+ */
+#define RT_NODE_INFO_COUNT_BITS 9
+#define RT_NODE_INFO_KIND_BITS 2
+#define RT_NODE_INFO_COUNT_MASK ((1 << RT_NODE_INFO_COUNT_BITS) - 1)
+#define RT_NODE_INFO_KIND_MASK ((1 << RT_NODE_INFO_KIND_BITS) - 1)
+
+/*
+ * Supported radix tree node kinds.
+ *
+ * XXX: These are currently not well chosen. To reduce memory fragmentation
+ * smaller class should optimally fit neatly into the next larger class
+ * (except perhaps at the lowest end). Right now its
+ * 40/40 -> 296/286 -> 1288/1304 -> 2056/2088 bytes for inner nodes and
+ * leaf nodes, respectively, leading to large amount of allocator padding
+ * with aset.c. Hence the use of slab.
+ *
+ * XXX: need to have node-1 until there is no path compression optimization?
+ *
+ * XXX: need to explain why we choose these node types based on benchmark
+ * results etc.
+ */
+#define RT_NODE_KIND_4 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_128 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+/* Macros to access the count and the kind in 'info' */
+#define NODE_GET_COUNT(n) (((rt_node *) (n))->info & RT_NODE_INFO_COUNT_MASK)
+#define NODE_GET_KIND(n) \
+ (((((rt_node* ) (n))->info) >> RT_NODE_INFO_COUNT_BITS) & RT_NODE_INFO_KIND_MASK)
+#define NODE_INCREMENT_COUNT(n) \
+ { \
+ ((rt_node *) (n))->info++; \
+ Assert(NODE_GET_COUNT(n) <= rt_node_kind_info[NODE_GET_KIND(n)].fanout); \
+ } while (0)
+#define NODE_DECREMENT_COUNT(n) \
+ { \
+ ((rt_node *) (n))->info--; \
+ Assert(NODE_GET_COUNT(n) >= 0); \
+ } while(0)
+#define NODE_SET_COUNT(n, count) \
+ { \
+ ((rt_node *) (n))->info &= ~RT_NODE_INFO_COUNT_MASK; \
+ ((rt_node *) (n))->info |= (count); \
+ } while (0)
+#define NODE_SET_KIND(n, kind) \
+ { \
+ ((rt_node *) (n))->info &= ~RT_NODE_INFO_KIND_MASK; \
+ ((rt_node *) (n))->info |= ((kind) << RT_NODE_INFO_COUNT_BITS); \
+ } while (0)
+#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n) (NODE_GET_COUNT(((rt_node *) (n))) == 0)
+#define NODE_HAS_FREE_SLOT(n) \
+ (NODE_GET_COUNT(n) < rt_node_kind_info[NODE_GET_KIND(n)].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+typedef struct rt_node_base_4
+{
+ rt_node n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+ rt_node n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-128 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 128 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base128
+{
+ rt_node n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+} rt_node_base_128;
+
+typedef struct rt_node_base256
+{
+ rt_node n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * There are separate from inner node size classes for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ * width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+ rt_node_base_4 base;
+
+ /* 4 children, for key chunks */
+ rt_node *children[4];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+ rt_node_base_4 base;
+
+ /* 4 values, for key chunks */
+ uint64 values[4];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+ rt_node_base_32 base;
+
+ /* 32 children, for key chunks */
+ rt_node *children[32];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+ rt_node_base_32 base;
+
+ /* 32 values, for key chunks */
+ uint64 values[32];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_128
+{
+ rt_node_base_128 base;
+
+ /* Slots for 128 children */
+ rt_node *children[128];
+} rt_node_inner_128;
+
+typedef struct rt_node_leaf_128
+{
+ rt_node_base_128 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(128)];
+
+ /* Slots for 128 values */
+ uint64 values[128];
+} rt_node_leaf_128;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+ rt_node_base_256 base;
+
+ /* Slots for 256 children */
+ rt_node *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+ rt_node_base_256 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ uint64 values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information of each size kinds */
+typedef struct rt_node_kind_info_elem
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+
+ /* slab block size */
+ Size inner_blocksize;
+ Size leaf_blocksize;
+} rt_node_kind_info_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * size, (size) * 32)
+static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
+
+ [RT_NODE_KIND_4] = {
+ .name = "radix tree node 4",
+ .fanout = 4,
+ .inner_size = sizeof(rt_node_inner_4),
+ .leaf_size = sizeof(rt_node_leaf_4),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4)),
+ },
+ [RT_NODE_KIND_32] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(rt_node_inner_32),
+ .leaf_size = sizeof(rt_node_leaf_32),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32)),
+ },
+ [RT_NODE_KIND_128] = {
+ .name = "radix tree node 128",
+ .fanout = 128,
+ .inner_size = sizeof(rt_node_inner_128),
+ .leaf_size = sizeof(rt_node_leaf_128),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128)),
+ },
+ [RT_NODE_KIND_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(rt_node_inner_256),
+ .leaf_size = sizeof(rt_node_leaf_256),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+ },
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+ rt_node *node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+ radix_tree *tree;
+
+ /* Track the iteration on nodes of each level */
+ rt_node_iter stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ rt_node *root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+ MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_NODE_KIND_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
+ bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+ rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+ uint64 *value_p);
+static rt_node *rt_node_add_new_child(radix_tree *tree, rt_node *parent,
+ rt_node *node, uint64 key);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, int from);
+static void rt_update_node_iter(rt_iter *iter, rt_node_iter *node_iter,
+ rt_node *node);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < NODE_GET_COUNT(node); i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in chunks in the given node that is greater
+ * than or equal to 'key'. Return -1 if there is no such element.
+ */
+static inline int
+node_4_search_ge(rt_node_base_4 * node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < NODE_GET_COUNT(node); i++)
+ {
+ if (node->chunks[i] >= chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = NODE_GET_COUNT(node);
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+ /* XXX: should not to use vector8_highbit_mask */
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the first element in chunks in the given node that is greater
+ * than or equal to 'key'. Return -1 if there is no such element.
+ */
+static inline int
+node_32_search_ge(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = NODE_GET_COUNT(node);
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] >= chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+ uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+ uint8 *dst_chunks, uint64 *dst_values, int count)
+{
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_values, src_values, sizeof(uint64) * count);
+}
+
+/* Functions to manipulate inner and leaf node-128 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_128_is_chunk_used(rt_node_base_128 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return (node->children[slot] != NULL);
+}
+
+static inline bool
+node_leaf_128_is_slot_used(rt_node_leaf_128 *node, uint8 slot)
+{
+ Assert(NODE_IS_LEAF(node));
+ return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+static inline rt_node *
+node_inner_128_get_child(rt_node_inner_128 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_128_get_value(rt_node_leaf_128 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(((rt_node_base_128 *) node)->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Delete the chunk in the node */
+static void
+node_inner_128_delete(rt_node_inner_128 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+/* Delete the chunk in the node */
+static void
+node_leaf_128_delete(rt_node_leaf_128 *node, uint8 chunk)
+{
+ int slotpos = node->base.slot_idxs[chunk];
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+ node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+static int
+node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
+{
+ int slotpos = 0;
+
+ Assert(!NODE_IS_LEAF(node));
+ while (node_inner_128_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+/* Return an unused slot in node-128 */
+static int
+node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ /*
+ * Find an unused slot. We iterate over the isset bitmap per byte then
+ * check each bit.
+ */
+ for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+ {
+ if (node->isset[slotpos] < 0xFF)
+ break;
+ }
+ Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+ slotpos *= BITS_PER_BYTE;
+ while (node_leaf_128_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+static inline void
+node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+{
+ int slotpos;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ /* find unused slot */
+ slotpos = node_inner_128_find_unused_slot(node, chunk);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ /* find unused slot */
+ slotpos = node_leaf_128_find_unused_slot(node, chunk);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+ node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+ndoe_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+/* Update the value corresponding to 'chunk' to 'value' */
+static inline void
+ndoe_leaf_128_update(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(node_inner_256_is_chunk_used(node, chunk));
+ return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(node_leaf_256_is_chunk_used(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+ int shift = key_get_shift(key);
+ rt_node *node;
+
+ node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0,
+ shift > 0);
+ tree->max_val = shift_get_max_val(shift);
+ tree->root = node;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
+{
+ rt_node *newnode;
+
+ if (inner)
+ newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_kind_info[kind].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_kind_info[kind].leaf_size);
+
+ NODE_SET_KIND(newnode, kind);
+ newnode->shift = shift;
+ newnode->chunk = chunk;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_128)
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) newnode;
+
+ memset(n128->slot_idxs, RT_NODE_128_INVALID_IDX, sizeof(n128->slot_idxs));
+ }
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[kind]++;
+#endif
+
+ return newnode;
+}
+
+static rt_node *
+rt_copy_node(radix_tree *tree, rt_node *node, int new_kind)
+{
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, new_kind, node->shift, node->chunk,
+ node->shift > 0);
+ NODE_SET_COUNT(newnode, NODE_GET_COUNT(node));
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->root == node)
+ tree->root = NULL;
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[NODE_GET_KIND(node)]--;
+ Assert(tree->cnt[NODE_GET_KIND(node)] >= 0);
+#endif
+
+ pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+ rt_node *new_child, uint64 key)
+{
+ Assert(old_child->chunk == new_child->chunk);
+ Assert(old_child->shift == new_child->shift);
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new large node */
+ tree->root = new_child;
+ }
+ else
+ {
+ bool replaced PG_USED_FOR_ASSERTS_ONLY;
+
+ replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+ Assert(replaced);
+ }
+
+ rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+ int target_shift;
+ int shift = tree->root->shift + RT_NODE_SPAN;
+
+ target_shift = key_get_shift(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ rt_node_inner_4 *node;
+
+ node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4,
+ shift, 0, true);
+ NODE_SET_COUNT(node, 1);
+ node->base.chunks[0] = 0;
+ node->children[0] = tree->root;
+
+ tree->root->chunk = 0;
+ tree->root = (rt_node *) node;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ rt_node *child = NULL;
+
+ switch (NODE_GET_KIND(node))
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = n4->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n4->base.chunks, n4->children,
+ NODE_GET_COUNT(n4), idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = n32->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n32->base.chunks, n32->children,
+ NODE_GET_COUNT(n32), idx);
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = node_inner_128_get_child(n128, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_128_delete(n128, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = node_inner_256_get_child(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ NODE_DECREMENT_COUNT(node);
+
+ if (found && child_p)
+ *child_p = child;
+
+ return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ uint64 value = 0;
+
+ switch (NODE_GET_KIND(node))
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = n4->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+ NODE_GET_COUNT(n4), idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = n32->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+ NODE_GET_COUNT(n32), idx);
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_128_get_value(n128, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_128_delete(n128, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_256_get_value(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ NODE_DECREMENT_COUNT(node);
+
+ if (found && value_p)
+ *value_p = value;
+
+ return found;
+}
+
+/* Insert a new child to 'node' */
+static rt_node *
+rt_node_add_new_child(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key)
+{
+ uint8 newshift = node->shift - RT_NODE_SPAN;
+ rt_node *newchild;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ newchild = rt_alloc_node(tree, RT_NODE_KIND_4, newshift,
+ RT_GET_KEY_CHUNK(key, node->shift),
+ newshift > 0);
+
+ rt_node_insert_inner(tree, parent, node, key, newchild);
+
+ return (rt_node *) newchild;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+ rt_node *child)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ switch (NODE_GET_KIND(node))
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ rt_node_inner_32 *new32;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->children[idx] = child;
+ break;
+ }
+
+ if (likely(NODE_HAS_FREE_SLOT(n4)))
+ {
+ int insertpos = node_4_search_ge((rt_node_base_4 *) n4, chunk);
+ uint16 count = NODE_GET_COUNT(n4);
+
+ if (insertpos < 0)
+ insertpos = count; /* insert to the tail */
+
+ /* shift chunks and children */
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n4->base.chunks, n4->children,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->children[insertpos] = child;
+ break;
+ }
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_children_array_copy(n4->base.chunks, n4->children,
+ new32->base.chunks, new32->children,
+ NODE_GET_COUNT(n4));
+
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+ key);
+ node = (rt_node *) new32;
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ rt_node_inner_128 *new128;
+
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->children[idx] = child;
+ break;
+ }
+
+ if (likely(NODE_HAS_FREE_SLOT(n32)))
+ {
+ int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk);
+ int16 count = NODE_GET_COUNT(n32);
+
+ if (insertpos < 0)
+ insertpos = count; /* insert to the tail */
+
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n32->base.chunks, n32->children,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->children[insertpos] = child;
+ break;
+ }
+
+ /* grow node from 32 to 128 */
+ new128 = (rt_node_inner_128 *) rt_copy_node(tree, (rt_node *) n32,
+ RT_NODE_KIND_128);
+ for (int i = 0; i < NODE_GET_COUNT(n32); i++)
+ node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+ key);
+ node = (rt_node *) new128;
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_128:
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+ rt_node_inner_256 *new256;
+ int cnt = 0;
+
+ if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ ndoe_inner_128_update(n128, chunk, child);
+ break;
+ }
+
+ if (likely(NODE_HAS_FREE_SLOT(n128)))
+ {
+ node_inner_128_insert(n128, chunk, child);
+ break;
+ }
+
+ /* grow node from 128 to 256 */
+ new256 = (rt_node_inner_256 *) rt_copy_node(tree, (rt_node *) n128,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < NODE_GET_COUNT(n128); i++)
+ {
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ continue;
+
+ node_inner_256_set(new256, i, node_inner_128_get_child(n128, i));
+ cnt++;
+ }
+
+ rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+
+ node_inner_256_set(n256, chunk, child);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ NODE_INCREMENT_COUNT(node);
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(NODE_IS_LEAF(node));
+
+ switch (NODE_GET_KIND(node))
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ rt_node_leaf_32 *new32;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->values[idx] = value;
+ break;
+ }
+
+ if (likely(NODE_HAS_FREE_SLOT(n4)))
+ {
+ int insertpos = node_4_search_ge((rt_node_base_4 *) n4, chunk);
+ int count = NODE_GET_COUNT(n4);
+
+ if (insertpos < 0)
+ insertpos = count; /* insert to the tail */
+
+ /* shift chunks and values */
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n4->base.chunks, n4->values,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->values[insertpos] = value;
+ break;
+ }
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_leaf_32 *) rt_copy_node(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_values_array_copy(n4->base.chunks, n4->values,
+ new32->base.chunks, new32->values,
+ NODE_GET_COUNT(n4));
+
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+ key);
+ node = (rt_node *) new32;
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ rt_node_leaf_128 *new128;
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->values[idx] = value;
+ break;
+ }
+
+ if (likely(NODE_HAS_FREE_SLOT(n32)))
+ {
+ int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk);
+ int count = NODE_GET_COUNT(n32);
+
+ if (insertpos < 0)
+ insertpos = count; /* insert to the tail */
+
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n32->base.chunks, n32->values,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->values[insertpos] = value;
+ break;
+ }
+
+ /* grow node from 32 to 128 */
+ new128 = (rt_node_leaf_128 *) rt_copy_node(tree, (rt_node *) n32,
+ RT_NODE_KIND_128);
+ for (int i = 0; i < NODE_GET_COUNT(n32); i++)
+ node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+ key);
+ node = (rt_node *) new128;
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_128:
+ {
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+ rt_node_leaf_256 *new256;
+ int cnt = 0;
+
+ if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ ndoe_leaf_128_update(n128, chunk, value);
+ break;
+ }
+
+ if (likely(NODE_HAS_FREE_SLOT(n128)))
+ {
+ node_leaf_128_insert(n128, chunk, value);
+ break;
+ }
+
+ /* grow node from 128 to 256 */
+ new256 = (rt_node_leaf_256 *) rt_copy_node(tree, (rt_node *) n128,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < NODE_GET_COUNT(n128); i++)
+ {
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ continue;
+
+ node_leaf_256_set(new256, i, node_leaf_128_get_value(n128, i));
+ cnt++;
+ }
+
+ rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+
+ node_leaf_256_set(n256, chunk, value);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ NODE_INCREMENT_COUNT(node);
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+ radix_tree *tree;
+ MemoryContext old_ctx;
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = palloc(sizeof(radix_tree));
+ tree->context = ctx;
+ tree->root = NULL;
+ tree->max_val = 0;
+ tree->num_keys = 0;
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].inner_blocksize,
+ rt_node_kind_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].leaf_blocksize,
+ rt_node_kind_info[i].leaf_size);
+#ifdef RT_DEBUG
+ tree->cnt[i] = 0;
+#endif
+ }
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+ int shift;
+ bool updated;
+ rt_node *node;
+ rt_node *parent = tree->root;
+
+ /* Empty tree, create the root */
+ if (!tree->root)
+ rt_new_root(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->max_val)
+ rt_extend(tree, key);
+
+ Assert(tree->root);
+
+ shift = tree->root->shift;
+ node = tree->root;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ child = rt_node_add_new_child(tree, parent, node, key);
+
+ Assert(child);
+
+ parent = node;
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* arrived at a leaf */
+ Assert(NODE_IS_LEAF(node));
+
+ updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->num_keys++;
+
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+ rt_node *node;
+ int shift;
+
+ Assert(value_p != NULL);
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ node = tree->root;
+ shift = tree->root->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* We reached at a leaf node, so search the corresponding slot */
+ Assert(NODE_IS_LEAF(node));
+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p))
+ return false;
+
+ return true;
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ int shift;
+ rt_node *stack[RT_MAX_LEVEL] = {0};
+ int level;
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes
+ * we visited.
+ */
+ node = tree->root;
+ shift = tree->root->shift;
+ level = 0;
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ /* Push the current node to the stack */
+ stack[level] = node;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+
+ Assert(NODE_IS_LEAF(node));
+
+ /* there is no key to delete */
+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL))
+ return false;
+
+ /* Update the statistics */
+ tree->num_keys--;
+
+ /*
+ * Delete the key from the leaf node and recursively delete the key in
+ * inner nodes if necessary.
+ */
+ Assert(NODE_IS_LEAF(stack[level]));
+ while (level >= 0)
+ {
+ rt_node *node = stack[level--];
+
+ if (NODE_IS_LEAF(node))
+ rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+ else
+ rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!NODE_IS_EMPTY(node))
+ break;
+
+ /* The node became empty */
+ rt_free_node(tree, node);
+ }
+
+ /*
+ * If we eventually deleted the root node while recursively deleting empty
+ * nodes, we make the tree empty.
+ */
+ if (level == 0)
+ {
+ tree->root = NULL;
+ tree->max_val = 0;
+ }
+
+ return true;;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+ MemoryContext old_ctx;
+ rt_iter *iter;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (rt_iter *) palloc0(sizeof(rt_iter));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree)
+ return iter;
+
+ top_level = iter->tree->root->shift / RT_NODE_SPAN;
+
+ iter->stack_len = top_level;
+ iter->stack[top_level].node = iter->tree->root;
+ iter->stack[top_level].current_idx = -1;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ rt_update_iter_stack(iter, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Update the stack of the radix tree node while descending to the leaf from
+ * the 'from' level.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, int from)
+{
+ rt_node *node = iter->stack[from].node;
+ int level = from;
+
+ for (;;)
+ {
+ rt_node_iter *node_iter = &(iter->stack[level--]);
+
+ /* Set the node to this level */
+ rt_update_node_iter(iter, node_iter, node);
+
+ /* Finish if we reached to the leaf node */
+ if (NODE_IS_LEAF(node))
+ break;
+
+ /* Advance to the next slot in the node */
+ node = rt_node_inner_iterate_next(iter, node_iter);
+
+ /*
+ * Since we always get the first slot in the node, we have to found
+ * the slot.
+ */
+ Assert(node);
+ }
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree)
+ return false;
+
+ for (;;)
+ {
+ rt_node_iter *node_iter;
+ rt_node *child = NULL;
+ uint64 value;
+ int level;
+ bool found;
+
+ found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * Iterate node at each level from the level=1 inner node until
+ * we find the next value to return.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* We could not find any new key-value pair, the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * We have advanced slots more than one nodes including both the lead
+ * node and inner nodes. So we update the stack by descending to
+ * the left most leaf node from this level.
+ */
+ node_iter = &(iter->stack[level - 1]);
+ rt_update_node_iter(iter, node_iter, child);
+ rt_update_iter_stack(iter, level - 1);
+ }
+
+ return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+ pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+ rt_node *child = NULL;
+ bool found = false;
+ uint8 key_chunk;
+
+ switch (NODE_GET_KIND(node_iter->node))
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= NODE_GET_COUNT(n4))
+ break;
+
+ child = n4->children[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= NODE_GET_COUNT(n32))
+ break;
+
+ child = n32->children[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < 256; i++)
+ {
+ if (node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ break;
+ }
+
+ if (i >= 256)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_128_get_child(n128, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+ int i;
+ for (i = node_iter->current_idx + 1; i < 256; i++)
+ {
+ if (node_inner_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= 256)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_256_get_child(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+ return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p)
+{
+ rt_node *node = node_iter->node;
+ bool found = false;
+ uint64 value;
+ uint8 key_chunk;
+
+ switch (NODE_GET_KIND(node))
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= NODE_GET_COUNT(n4))
+ break;
+
+ value = n4->values[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= NODE_GET_COUNT(n32))
+ break;
+
+ value = n32->values[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < 256; i++)
+ {
+ if (node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ break;
+ }
+
+ if (i >= 256)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_128_get_value(n128, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+ int i;
+ for (i = node_iter->current_idx + 1; i < 256; i++)
+ {
+ if (node_leaf_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= 256)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_256_get_value(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ *value_p = value;
+ }
+
+ return found;
+}
+
+/*
+ * Set the node to the node_iter so we can begin the iteration of the node.
+ * Also, we update the part of the key by the chunk of the given node.
+ */
+static void
+rt_update_node_iter(rt_iter *iter, rt_node_iter *node_iter,
+ rt_node *node)
+{
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ rt_iter_update_key(iter, node->chunk, node->shift + RT_NODE_SPAN);
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+ return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+ Size total = 0;
+
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(NODE_GET_COUNT(node) >= 0);
+
+ switch (NODE_GET_KIND(node))
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+ for (int i = 1; i < NODE_GET_COUNT(n4); i++)
+ Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+ for (int i = 1; i < NODE_GET_COUNT(n32); i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_128_is_chunk_used(n128, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ if (NODE_IS_LEAF(node))
+ Assert(node_leaf_128_is_slot_used((rt_node_leaf_128 *) node,
+ n128->slot_idxs[i]));
+ else
+ Assert(node_inner_128_is_slot_used((rt_node_inner_128 *) node,
+ n128->slot_idxs[i]));
+
+ cnt++;
+ }
+
+ Assert(NODE_GET_COUNT(n128) == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+ cnt += pg_popcount32(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(NODE_GET_COUNT(n256) == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+ ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
+ tree->num_keys,
+ tree->root->shift / RT_NODE_SPAN,
+ tree->cnt[0],
+ tree->cnt[1],
+ tree->cnt[2],
+ tree->cnt[3])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+ char space[128] = {0};
+
+ fprintf(stderr, "[%s] kind %d, count %u, shift %u, chunk 0x%X:\n",
+ NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (NODE_GET_KIND(node) == RT_NODE_KIND_4) ? 4 :
+ (NODE_GET_KIND(node) == RT_NODE_KIND_32) ? 32 :
+ (NODE_GET_KIND(node) == RT_NODE_KIND_128) ? 128 : 256,
+ NODE_GET_COUNT(node), node->shift, node->chunk);
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ switch (NODE_GET_KIND(node))
+ {
+ case RT_NODE_KIND_4:
+ {
+ for (int i = 0; i < NODE_GET_COUNT(node); i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+ space, n4->base.chunks[i], n4->values[i]);
+ }
+ else
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n4->base.chunks[i]);
+
+ if (recurse)
+ rt_dump_node(n4->children[i], level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < NODE_GET_COUNT(node); i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+ space, n32->base.chunks[i], n32->values[i]);
+ }
+ else
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ rt_dump_node(n32->children[i], level + 1, recurse);
+ }
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *b128 = (rt_node_base_128 *) node;
+
+ fprintf(stderr, "slot_idxs ");
+ for (int i = 0; i < 256; i++)
+ {
+ if (!node_128_is_chunk_used(b128, i))
+ continue;
+
+ fprintf(stderr, " [%d]=%d, ", i, b128->slot_idxs[i]);
+ }
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_128 *n = (rt_node_leaf_128 *) node;
+
+ fprintf(stderr, ", isset-bitmap:");
+ for (int i = 0; i < 16; i++)
+ {
+ fprintf(stderr, "%X ", (uint8) n->isset[i]);
+ }
+ fprintf(stderr, "\n");
+ }
+
+ for (int i = 0; i < 256; i++)
+ {
+ if (!node_128_is_chunk_used(b128, i))
+ continue;
+
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) b128;
+
+ fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+ space, i, node_leaf_128_get_value(n128, i));
+ }
+ else
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) b128;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_128_get_child(n128, i),
+ level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ for (int i = 0; i < 256; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+ space, i, node_leaf_256_get_value(n256, i));
+ }
+ else
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+ recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+ if (!tree->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->max_val)
+ {
+ elog(NOTICE, "key %lu (0x%lX) is larger than max val",
+ key, key);
+ return;
+ }
+
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ rt_dump_node(node, level, false);
+
+ if (NODE_IS_LEAF(node))
+ {
+ uint64 dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+ break;
+ }
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].inner_size,
+ rt_node_kind_info[i].inner_blocksize,
+ rt_node_kind_info[i].leaf_size,
+ rt_node_kind_info[i].leaf_blocksize);
+ fprintf(stderr, "max_val = %lu\n", tree->max_val);
+
+ if (!tree->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif /* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 7b3f292965..e587cabe13 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -26,6 +26,7 @@ SUBDIRS = \
test_parser \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..cc6970c87c
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,28 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..a4aa80a99c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,504 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+/* The maximum number of entries each node type can have */
+static int rt_node_max_entries[] = {
+ 4, /* RT_NODE_KIND_4 */
+ 16, /* RT_NODE_KIND_16 */
+ 32, /* RT_NODE_KIND_32 */
+ 128, /* RT_NODE_KIND_128 */
+ 256 /* RT_NODE_KIND_256 */
+};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 10000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ radix_tree *radixtree;
+ uint64 dummy;
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ uint64 val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, key);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+ for (int j = 0; j < lengthof(rt_node_max_entries); j++)
+ {
+ /*
+ * After filling all slots in each node type, check if the values are
+ * stored properly.
+ */
+ if (i == (rt_node_max_entries[j] - 1))
+ {
+ check_search_on_node(radixtree, shift,
+ (j == 0) ? 0 : rt_node_max_entries[j - 1],
+ rt_node_max_entries[j]);
+ break;
+ }
+ }
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "inserted key 0x" UINT64_HEX_FORMAT " is not found", key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ radix_tree *radixtree;
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search
+ * entries again.
+ */
+ test_node_types_insert(radixtree, shift);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift);
+
+ rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec *spec)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+ radixtree = rt_create(radixtree_ctx);
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the
+ * stats from the memory context. They should be in the same ballpark,
+ * but it's hard to automate testing that, so if you're making changes to
+ * the implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
--
2.31.1
0003-tool-for-measuring-radix-tree-performance.patchapplication/x-patch; name=0003-tool-for-measuring-radix-tree-performance.patchDownload
From 726959296d734784292a46e5a01c95a276820db0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [POC PATCH 3/3] tool for measuring radix tree performance
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 56 +++
contrib/bench_radix_tree/bench_radix_tree.c | 447 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
6 files changed, 559 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..0874201d7e
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,56 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..673f96c860
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,447 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper-lower)+0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time, end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms, rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint64 key, val;
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms, ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time, end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time, end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms, rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r, h, i, j, k;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ /* lower nodes have limited fanout, the top is only limited by bits-per-byte */
+ for (r=1;;r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+ key = (r<<32) | (h<<24) | (i<<16) | (j<<8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r=1;;r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key, val;
+ key = (r<<32) | (h<<24) | (i<<16) | (j<<8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
--
2.31.1
On Mon, Oct 24, 2022 at 12:54 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
I've attached updated PoC patches for discussion and cfbot. From the
previous version, I mainly changed the following things:* Separate treatment of inner and leaf nodes
Overall, this looks much better!
* Pack both the node kind and node count to an uint16 value.
For this, I did mention a bitfield earlier as something we "could" do, but
it wasn't clear we should. After looking again at the node types, I must
not have thought through this at all. Storing one byte instead of four for
the full enum is a good step, but saving one more byte usually doesn't buy
anything because of padding, with a few exceptions like this example:
node4: 4 + 4 + 4*8 = 40
node4: 5 + 4+(7) + 4*8 = 48 bytes
Even there, I'd rather not spend the extra cycles to access the members.
And with my idea of decoupling size classes from kind, the variable-sized
kinds will require another byte to store "capacity". Then, even if the kind
gets encoded in a pointer tag, we'll still have 5 bytes in the base type.
So I think we should assume 5 bytes from the start. (Might be 6 temporarily
if I work on size decoupling first).
(Side note, if you have occasion to use bitfields again in the future, C99
has syntactic support for them, so no need to write your own
shifting/masking code).
I've not done SIMD part seriously yet. But overall the performance
seems good so far. If we agree with the current approach, I think we
can proceed with the verification of decoupling node sizes from node
kind. And I'll investigate DSA support.
Sounds good. I have some additional comments about v7, and after these are
addressed, we can proceed independently with the above two items. Seeing
the DSA work will also inform me how invasive pointer tagging will be.
There will still be some performance tuning and cosmetic work, but it's
getting closer.
-------------------------
0001:
+#ifndef USE_NO_SIMD
+#include "port/pg_bitutils.h"
+#endif
Leftover from an earlier version?
+static inline int vector8_find(const Vector8 v, const uint8 c);
+static inline int vector8_find_ge(const Vector8 v, const uint8 c);
Leftovers, causing compiler warnings. (Also see new variable shadow warning)
+#else /* USE_NO_SIMD */
+ Vector8 r = 0;
+ uint8 *rp = (uint8 *) &r;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]);
+
+ return r;
+#endif
As I mentioned a couple versions ago, this style is really awkward, and
potential non-SIMD callers will be better off writing their own byte-wise
loop rather than using this API. Especially since the "min" function exists
only as a workaround for lack of unsigned comparison in (at least) SSE2.
There is one existing function in this file with that idiom for non-assert
code (for completeness), but even there, inputs of current interest to us
use the uint64 algorithm.
0002:
+ /* XXX: should not to use vector8_highbit_mask */
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) <<
sizeof(Vector8));
Hmm?
+/*
+ * Return index of the first element in chunks in the given node that is
greater
+ * than or equal to 'key'. Return -1 if there is no such element.
+ */
+static inline int
+node_32_search_ge(rt_node_base_32 *node, uint8 chunk)
The caller must now have logic for inserting at the end:
+ int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk);
+ int16 count = NODE_GET_COUNT(n32);
+
+ if (insertpos < 0)
+ insertpos = count; /* insert to the tail */
It would be a bit more clear if node_*_search_ge() always returns the
position we need (see the prototype for example). In fact, these functions
are probably better named node*_get_insertpos().
+ if (likely(NODE_HAS_FREE_SLOT(n128)))
+ {
+ node_inner_128_insert(n128, chunk, child);
+ break;
+ }
+
+ /* grow node from 128 to 256 */
We want all the node-growing code to be pushed down to the bottom so that
all branches of the hot path are close together. This provides better
locality for the CPU frontend. Looking at the assembly, the above doesn't
have the desired effect, so we need to write like this (also see prototype):
if (unlikely( ! has-free-slot))
grow-node;
else
{
...;
break;
}
/* FALLTHROUGH */
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ child = rt_node_add_new_child(tree, parent, node, key);
+
+ Assert(child);
+
+ parent = node;
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
Note that if we have to call rt_node_add_new_child(), each successive loop
iteration must search it and find nothing there (the prototype had a
separate function to handle this). Maybe it's not that critical yet, but
something to keep in mind as we proceed. Maybe a comment about it to remind
us.
+ /* there is no key to delete */
+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL))
+ return false;
+
+ /* Update the statistics */
+ tree->num_keys--;
+
+ /*
+ * Delete the key from the leaf node and recursively delete the key in
+ * inner nodes if necessary.
+ */
+ Assert(NODE_IS_LEAF(stack[level]));
+ while (level >= 0)
+ {
+ rt_node *node = stack[level--];
+
+ if (NODE_IS_LEAF(node))
+ rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+ else
+ rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!NODE_IS_EMPTY(node))
+ break;
+
+ /* The node became empty */
+ rt_free_node(tree, node);
+ }
Here we call rt_node_search_leaf() twice -- once to check for existence,
and once to delete. All three search calls are inlined, so this wastes
space. Let's try to delete the leaf, return if not found, otherwise handle
the leaf bookkeepping and loop over the inner nodes. This might require
some duplication of code.
+ndoe_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
Spelling
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+ uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}
gcc generates better code with something like this (but not hard-coded) at
the top:
if (count > 4)
pg_unreachable();
This would have to change when we implement shrinking of nodes, but might
still be useful.
+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p))
+ return false;
+
+ return true;
Maybe just "return rt_node_search_leaf(...)" ?
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Oct 26, 2022 at 8:06 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Mon, Oct 24, 2022 at 12:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've attached updated PoC patches for discussion and cfbot. From the
previous version, I mainly changed the following things:
Thank you for the comments!
* Separate treatment of inner and leaf nodes
Overall, this looks much better!
* Pack both the node kind and node count to an uint16 value.
For this, I did mention a bitfield earlier as something we "could" do, but it wasn't clear we should. After looking again at the node types, I must not have thought through this at all. Storing one byte instead of four for the full enum is a good step, but saving one more byte usually doesn't buy anything because of padding, with a few exceptions like this example:
node4: 4 + 4 + 4*8 = 40
node4: 5 + 4+(7) + 4*8 = 48 bytesEven there, I'd rather not spend the extra cycles to access the members. And with my idea of decoupling size classes from kind, the variable-sized kinds will require another byte to store "capacity". Then, even if the kind gets encoded in a pointer tag, we'll still have 5 bytes in the base type. So I think we should assume 5 bytes from the start. (Might be 6 temporarily if I work on size decoupling first).
True. I'm going to start with 6 bytes and will consider reducing it to
5 bytes. Encoding the kind in a pointer tag could be tricky given DSA
support so currently I'm thinking to pack the node kind and node
capacity classes to uint8.
(Side note, if you have occasion to use bitfields again in the future, C99 has syntactic support for them, so no need to write your own shifting/masking code).
Thanks!
I've not done SIMD part seriously yet. But overall the performance
seems good so far. If we agree with the current approach, I think we
can proceed with the verification of decoupling node sizes from node
kind. And I'll investigate DSA support.Sounds good. I have some additional comments about v7, and after these are addressed, we can proceed independently with the above two items. Seeing the DSA work will also inform me how invasive pointer tagging will be. There will still be some performance tuning and cosmetic work, but it's getting closer.
I've made some progress on investigating DSA support. I've written
draft patch for that and regression tests passed. I'll share it as a
separate patch for discussion with v8 radix tree patch.
While implementing DSA support, I realized that we may not need to use
pointer tagging to distinguish between backend-local address or
dsa_pointer. In order to get a backend-local address from dsa_pointer,
we need to pass dsa_area like:
node = dsa_get_address(tree->dsa, node_dp);
As shown above, the dsa area used by the shared radix tree is stored
in radix_tree struct, so we can know whether the radix tree is shared
or not by checking (tree->dsa == NULL). That is, if it's shared we use
a pointer to radix tree node as dsa_pointer, and if not we use a
pointer as a backend-local pointer. We don't need to encode something
in a pointer.
-------------------------
0001:+#ifndef USE_NO_SIMD +#include "port/pg_bitutils.h" +#endifLeftover from an earlier version?
+static inline int vector8_find(const Vector8 v, const uint8 c); +static inline int vector8_find_ge(const Vector8 v, const uint8 c);Leftovers, causing compiler warnings. (Also see new variable shadow warning)
Will fix.
+#else /* USE_NO_SIMD */ + Vector8 r = 0; + uint8 *rp = (uint8 *) &r; + + for (Size i = 0; i < sizeof(Vector8); i++) + rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]); + + return r; +#endifAs I mentioned a couple versions ago, this style is really awkward, and potential non-SIMD callers will be better off writing their own byte-wise loop rather than using this API. Especially since the "min" function exists only as a workaround for lack of unsigned comparison in (at least) SSE2. There is one existing function in this file with that idiom for non-assert code (for completeness), but even there, inputs of current interest to us use the uint64 algorithm.
Agreed. Will remove non-SIMD code.
0002:
+ /* XXX: should not to use vector8_highbit_mask */ + bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));Hmm?
It's my outdated memo, will remove.
+/* + * Return index of the first element in chunks in the given node that is greater + * than or equal to 'key'. Return -1 if there is no such element. + */ +static inline int +node_32_search_ge(rt_node_base_32 *node, uint8 chunk)The caller must now have logic for inserting at the end:
+ int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk); + int16 count = NODE_GET_COUNT(n32); + + if (insertpos < 0) + insertpos = count; /* insert to the tail */It would be a bit more clear if node_*_search_ge() always returns the position we need (see the prototype for example). In fact, these functions are probably better named node*_get_insertpos().
Agreed.
+ if (likely(NODE_HAS_FREE_SLOT(n128))) + { + node_inner_128_insert(n128, chunk, child); + break; + } + + /* grow node from 128 to 256 */We want all the node-growing code to be pushed down to the bottom so that all branches of the hot path are close together. This provides better locality for the CPU frontend. Looking at the assembly, the above doesn't have the desired effect, so we need to write like this (also see prototype):
if (unlikely( ! has-free-slot))
grow-node;
else
{
...;
break;
}
/* FALLTHROUGH */
Good point. Will change.
+ /* Descend the tree until a leaf node */ + while (shift >= 0) + { + rt_node *child; + + if (NODE_IS_LEAF(node)) + break; + + if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child)) + child = rt_node_add_new_child(tree, parent, node, key); + + Assert(child); + + parent = node; + node = child; + shift -= RT_NODE_SPAN; + }Note that if we have to call rt_node_add_new_child(), each successive loop iteration must search it and find nothing there (the prototype had a separate function to handle this). Maybe it's not that critical yet, but something to keep in mind as we proceed. Maybe a comment about it to remind us.
Agreed. Currently rt_extend() is used to add upper nodes but probably
we need another function to add lower nodes for this case.
+ /* there is no key to delete */ + if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL)) + return false; + + /* Update the statistics */ + tree->num_keys--; + + /* + * Delete the key from the leaf node and recursively delete the key in + * inner nodes if necessary. + */ + Assert(NODE_IS_LEAF(stack[level])); + while (level >= 0) + { + rt_node *node = stack[level--]; + + if (NODE_IS_LEAF(node)) + rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL); + else + rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL); + + /* If the node didn't become empty, we stop deleting the key */ + if (!NODE_IS_EMPTY(node)) + break; + + /* The node became empty */ + rt_free_node(tree, node); + }Here we call rt_node_search_leaf() twice -- once to check for existence, and once to delete. All three search calls are inlined, so this wastes space. Let's try to delete the leaf, return if not found, otherwise handle the leaf bookkeepping and loop over the inner nodes. This might require some duplication of code.
Agreed.
+ndoe_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
Spelling
WIll fix.
+static inline void +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children, + uint8 *dst_chunks, rt_node **dst_children, int count) +{ + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count); + memcpy(dst_children, src_children, sizeof(rt_node *) * count); +}gcc generates better code with something like this (but not hard-coded) at the top:
if (count > 4)
pg_unreachable();
Agreed.
This would have to change when we implement shrinking of nodes, but might still be useful.
+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p)) + return false; + + return true;Maybe just "return rt_node_search_leaf(...)" ?
Agreed.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Thu, Oct 27, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
True. I'm going to start with 6 bytes and will consider reducing it to
5 bytes.
Okay, let's plan on 6 for now, so we have the worst-case sizes up front. As
discussed, I will attempt the size class decoupling after v8 and see how it
goes.
Encoding the kind in a pointer tag could be tricky given DSA
If it turns out to be unworkable, that's life. If it's just tricky, that
can certainly be put off for future work. I hope to at least test it out
with local memory.
support so currently I'm thinking to pack the node kind and node
capacity classes to uint8.
That won't work, if we need 128 for capacity, leaving no bits left. I want
the capacity to be a number we can directly compare with the count (we
won't ever need to store 256 because that node will never grow). Also,
further to my last message, we need to access the kind quickly, without
more cycles.
I've made some progress on investigating DSA support. I've written
draft patch for that and regression tests passed. I'll share it as a
separate patch for discussion with v8 radix tree patch.
Great!
While implementing DSA support, I realized that we may not need to use
pointer tagging to distinguish between backend-local address or
dsa_pointer. In order to get a backend-local address from dsa_pointer,
we need to pass dsa_area like:
I was not clear -- when I see how much code changes to accommodate DSA
pointers, I imagine I will pretty much know the places that would be
affected by tagging the pointer with the node kind.
Speaking of tests, there is currently no Meson support, but tests pass
because this library is not used anywhere in the backend yet, and
apparently the CI Meson builds don't know to run the regression test? That
will need to be done too. However, it's okay to keep the benchmarking
module in autoconf, since it won't be committed.
+static inline void +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children, + uint8 *dst_chunks, rt_node **dst_children, int count) +{ + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count); + memcpy(dst_children, src_children, sizeof(rt_node *) * count); +}gcc generates better code with something like this (but not hard-coded)
at the top:
if (count > 4)
pg_unreachable();
Actually it just now occurred to me there's a bigger issue here: *We* know
this code can only get here iff count==4, so why doesn't the compiler know
that? I believe it boils down to
static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
In the assembly, I see it checks if there is room in the node by doing a
runtime lookup in this array, which is not constant. This might not be
important just yet, because I want to base the check on the proposed node
capacity instead, but I mention it as a reminder to us to make sure we take
all opportunities for the compiler to propagate constants.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Oct 27, 2022 at 12:21 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Thu, Oct 27, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
True. I'm going to start with 6 bytes and will consider reducing it to
5 bytes.Okay, let's plan on 6 for now, so we have the worst-case sizes up front. As discussed, I will attempt the size class decoupling after v8 and see how it goes.
Encoding the kind in a pointer tag could be tricky given DSA
If it turns out to be unworkable, that's life. If it's just tricky, that can certainly be put off for future work. I hope to at least test it out with local memory.
support so currently I'm thinking to pack the node kind and node
capacity classes to uint8.That won't work, if we need 128 for capacity, leaving no bits left. I want the capacity to be a number we can directly compare with the count (we won't ever need to store 256 because that node will never grow). Also, further to my last message, we need to access the kind quickly, without more cycles.
Understood.
I've made some progress on investigating DSA support. I've written
draft patch for that and regression tests passed. I'll share it as a
separate patch for discussion with v8 radix tree patch.Great!
While implementing DSA support, I realized that we may not need to use
pointer tagging to distinguish between backend-local address or
dsa_pointer. In order to get a backend-local address from dsa_pointer,
we need to pass dsa_area like:I was not clear -- when I see how much code changes to accommodate DSA pointers, I imagine I will pretty much know the places that would be affected by tagging the pointer with the node kind.
Speaking of tests, there is currently no Meson support, but tests pass because this library is not used anywhere in the backend yet, and apparently the CI Meson builds don't know to run the regression test? That will need to be done too. However, it's okay to keep the benchmarking module in autoconf, since it won't be committed.
Updated to support Meson.
+static inline void +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children, + uint8 *dst_chunks, rt_node **dst_children, int count) +{ + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count); + memcpy(dst_children, src_children, sizeof(rt_node *) * count); +}gcc generates better code with something like this (but not hard-coded) at the top:
if (count > 4)
pg_unreachable();Actually it just now occurred to me there's a bigger issue here: *We* know this code can only get here iff count==4, so why doesn't the compiler know that? I believe it boils down to
static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
In the assembly, I see it checks if there is room in the node by doing a runtime lookup in this array, which is not constant. This might not be important just yet, because I want to base the check on the proposed node capacity instead, but I mention it as a reminder to us to make sure we take all opportunities for the compiler to propagate constants.
I've attached v8 patches. 0001, 0002, and 0003 patches incorporated
the comments I got so far. 0004 patch is a DSA support patch for PoC.
In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes
to point its children, and we use rt_node_ptr as either rt_node* or
dsa_pointer depending on whether the radix tree is shared or not (ie,
by checking radix_tree->dsa == NULL). Regarding the performance, I've
added another boolean argument to bench_seq/shuffle_search(),
specifying whether to use the shared radix tree or not. Here are
benchmark results in my environment,
select * from bench_seq_search(0, 1* 1000 * 1000, false, false);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 9871240 | 180000000 | 67 |
| 241 |
(1 row)
select * from bench_seq_search(0, 1* 1000 * 1000, false, true);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 14680064 | 180000000 | 81 |
| 483 |
(1 row)
select * from bench_seq_search(0, 2* 1000 * 1000, true, false);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 19680872 | 179937720 | 74 |
| 235 |
(1 row)
select * from bench_seq_search(0, 2* 1000 * 1000, true, true);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 23068672 | 179937720 | 86 |
| 445 |
(1 row)
select * from bench_shuffle_search(0, 1* 1000 * 1000, false, false);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 9871240 | 180000000 | 67 |
| 640 |
(1 row)
select * from bench_shuffle_search(0, 1* 1000 * 1000, false, true);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 14680064 | 180000000 | 81 |
| 1002 |
(1 row)
select * from bench_shuffle_search(0, 2* 1000 * 1000, true, false);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 19680872 | 179937720 | 74 |
| 697 |
(1 row)
select * from bench_shuffle_search(0, 2* 1000 * 1000, true, true);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 23068672 | 179937720 | 86 |
| 1030 |
(1 row)
In non-shared radix tree cases (the forth argument is false), I don't
see a visible performance degradation. On the other hand, in shared
radix tree cases (the forth argument is true), I see visible overheads
because of dsa_get_address().
Please note that the current shared radix tree implementation doesn't
support any locking, so it cannot be read while written by someone.
Also, only one process can iterate over the shared radix tree. When it
comes to parallel vacuum, these don't become restriction as the leader
process writes the radix tree while scanning heap and the radix tree
is read by multiple processes while vacuuming indexes. And only the
leader process can do heap vacuum by iterating the key-value pairs in
the radix tree. If we want to use it for other cases too, we would
need to support locking, RCU or something.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v8-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/octet-stream; name=v8-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload
From 8a240268c8135a871f80b8d465e0335745f2cedd Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v8 1/4] introduce vector8_min and vector8_highbit_mask
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..0b288c422a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
static inline bool vector8_has_zero(const Vector8 v);
static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
#endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
#endif
}
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+ uint32 mask = 0;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+ return mask;
+#endif
+}
+
/*
* Exactly like vector8_is_highbit_set except for the input type, so it
* looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.31.1
v8-0004-PoC-DSA-support-for-radix-tree.patchapplication/octet-stream; name=v8-0004-PoC-DSA-support-for-radix-tree.patchDownload
From eac9256167afc948166144820e0d884c9e89f8cc Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 27 Oct 2022 14:02:00 +0900
Subject: [PATCH v8 4/4] PoC: DSA support for radix tree.
---
.../bench_radix_tree--1.0.sql | 2 +
contrib/bench_radix_tree/bench_radix_tree.c | 12 +-
src/backend/lib/radixtree.c | 683 ++++++++++++------
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 6 +-
src/include/utils/dsa.h | 1 +
.../expected/test_radixtree.out | 17 +
.../modules/test_radixtree/test_radixtree.c | 98 ++-
8 files changed, 558 insertions(+), 273 deletions(-)
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 0874201d7e..cf294c01d6 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -7,6 +7,7 @@ create function bench_shuffle_search(
minblk int4,
maxblk int4,
random_block bool DEFAULT false,
+shared bool DEFAULT false,
OUT nkeys int8,
OUT rt_mem_allocated int8,
OUT array_mem_allocated int8,
@@ -23,6 +24,7 @@ create function bench_seq_search(
minblk int4,
maxblk int4,
random_block bool DEFAULT false,
+shared bool DEFAULT false,
OUT nkeys int8,
OUT rt_mem_allocated int8,
OUT array_mem_allocated int8,
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 7abb237e96..be3f7ed811 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -15,6 +15,7 @@
#include "lib/radixtree.h"
#include <math.h>
#include "miscadmin.h"
+#include "storage/lwlock.h"
#include "utils/timestamp.h"
PG_MODULE_MAGIC;
@@ -149,7 +150,9 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
BlockNumber minblk = PG_GETARG_INT32(0);
BlockNumber maxblk = PG_GETARG_INT32(1);
bool random_block = PG_GETARG_BOOL(2);
+ bool shared = PG_GETARG_BOOL(3);
radix_tree *rt = NULL;
+ dsa_area *dsa = NULL;
uint64 ntids;
uint64 key;
uint64 last_key = PG_UINT64_MAX;
@@ -171,8 +174,11 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+ if (shared)
+ dsa = dsa_create(LWLockNewTrancheId());
+
/* measure the load time of the radix tree */
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, dsa);
start_time = GetCurrentTimestamp();
for (int i = 0; i < ntids; i++)
{
@@ -323,7 +329,7 @@ bench_load_random_int(PG_FUNCTION_ARGS)
elog(ERROR, "return type must be a row type");
pg_prng_seed(&state, 0);
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
start_time = GetCurrentTimestamp();
for (uint64 i = 0; i < cnt; i++)
@@ -375,7 +381,7 @@ bench_fixed_height_search(PG_FUNCTION_ARGS)
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
start_time = GetCurrentTimestamp();
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index b239b3c615..3b06f22af5 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -22,6 +22,15 @@
* choose it to avoid an additional pointer traversal. It is the reason this code
* currently does not support variable-length keys.
*
+ * If DSA space is specified when rt_create(), the radix tree is created in the
+ * DSA space so that multiple processes can access to it simultaneously. The process
+ * who created the shared radix tree need to tell both DSA area specified when
+ * calling to rt_create() and dsa_pointer of the radix tree, fetched by
+ * rt_get_dsa_pointer(), other processes so that they can attach by rt_attach().
+ *
+ * XXX: shared radix tree is still PoC state as it doesn't have any locking support.
+ * Also, it supports only single-process iteration.
+ *
* XXX: Most functions in this file have two variants for inner nodes and leaf
* nodes, therefore there are duplication codes. While this sometimes makes the
* code maintenance tricky, this reduces branch prediction misses when judging
@@ -59,12 +68,13 @@
#include "postgres.h"
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
#include "miscadmin.h"
#include "port/pg_bitutils.h"
#include "port/pg_lfind.h"
+#include "utils/dsa.h"
#include "utils/memutils.h"
-#include "lib/radixtree.h"
-#include "lib/stringinfo.h"
/* The number of bits encoded in one tree level */
#define RT_NODE_SPAN BITS_PER_BYTE
@@ -152,6 +162,17 @@ typedef struct rt_node
#define NODE_HAS_FREE_SLOT(n) \
(((rt_node *) (n))->count < rt_node_kind_info[((rt_node *) (n))->kind].fanout)
+/*
+ * rt_node_ptr is used as a pointer for rt_node. It can be either a local address
+ * in non-shared radix tree case (RadixTreeIsShared() is true) or a dsa_pointer in
+ * shared radix tree case. The inner nodes of the radix tree need to use rt_node_ptr
+ * to store the child rt_node pointer instead of C-pointers. A rt_node_ptr can be
+ * converted to a local address of rt_node by using node_ptr_get_local().
+ */
+typedef uintptr_t rt_node_ptr;
+#define InvalidRTNodePointer ((rt_node_ptr) 0)
+#define RTNodePtrIsValid(x) ((x) != InvalidRTNodePointer)
+
/* Base type of each node kinds for leaf and inner nodes */
typedef struct rt_node_base_4
{
@@ -205,7 +226,7 @@ typedef struct rt_node_inner_4
rt_node_base_4 base;
/* 4 children, for key chunks */
- rt_node *children[4];
+ rt_node_ptr children[4];
} rt_node_inner_4;
typedef struct rt_node_leaf_4
@@ -221,7 +242,7 @@ typedef struct rt_node_inner_32
rt_node_base_32 base;
/* 32 children, for key chunks */
- rt_node *children[32];
+ rt_node_ptr children[32];
} rt_node_inner_32;
typedef struct rt_node_leaf_32
@@ -237,7 +258,7 @@ typedef struct rt_node_inner_128
rt_node_base_128 base;
/* Slots for 128 children */
- rt_node *children[128];
+ rt_node_ptr children[128];
} rt_node_inner_128;
typedef struct rt_node_leaf_128
@@ -260,7 +281,7 @@ typedef struct rt_node_inner_256
rt_node_base_256 base;
/* Slots for 256 children */
- rt_node *children[RT_NODE_MAX_SLOTS];
+ rt_node_ptr children[RT_NODE_MAX_SLOTS];
} rt_node_inner_256;
typedef struct rt_node_leaf_256
@@ -344,6 +365,11 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
* construct the key whenever updating the node iteration information, e.g., when
* advancing the current index within the node or when moving to the next node
* at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than rt_node_ptr.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
*/
typedef struct rt_node_iter
{
@@ -363,37 +389,56 @@ struct rt_iter
uint64 key;
};
-/* A radix tree with nodes */
-struct radix_tree
+/* Control information for an radix tree */
+typedef struct radix_tree_control
{
- MemoryContext context;
+ rt_node_ptr root;
- rt_node *root;
+ /* XXX: use pg_atomic_uint64 instead */
uint64 max_val;
uint64 num_keys;
- MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
- MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
-
/* statistics */
#ifdef RT_DEBUG
int32 cnt[RT_NODE_KIND_COUNT];
#endif
+} radix_tree_control;
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ /* pointing to either local memory or DSA */
+ radix_tree_control *ctl;
+
+ /* used only when the radix tree is shared */
+ dsa_area *dsa;
+ dsa_pointer ctl_dp;
+
+ /* used only when the radix tree is private */
+ MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+ MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
};
+#define RadixTreeIsShared(rt) ((rt)->dsa != NULL)
static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node *rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
- bool inner);
-static void rt_free_node(radix_tree *tree, rt_node *node);
+static rt_node_ptr rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
+ bool inner);
+static rt_node_ptr rt_copy_node(radix_tree *tree, rt_node *node, int new_kind);
+static void rt_free_node(radix_tree *tree, rt_node_ptr nodep);
+static void rt_replace_node(radix_tree *tree, rt_node *parent, rt_node_ptr oldp,
+ rt_node_ptr newp, uint64 key);
static void rt_extend(radix_tree *tree, uint64 key);
static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
- rt_node **child_p);
+ rt_node_ptr *childp_p);
static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
uint64 *value_p);
-static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
- uint64 key, rt_node *child);
-static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
- uint64 key, uint64 value);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node_ptr nodep,
+ rt_node *node, uint64 key, rt_node_ptr childp);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node_ptr nodep,
+ rt_node *node, uint64 key, uint64 value);
+static inline void rt_node_update_inner(rt_node *node, uint64 key, rt_node_ptr newchildp);
static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
uint64 *value_p);
@@ -403,6 +448,15 @@ static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
/* verification (available only with assertion) */
static void rt_verify_node(rt_node *node);
+/* Get the local address of nodep */
+static inline rt_node *
+node_ptr_get_local(radix_tree *tree, rt_node_ptr nodep)
+{
+ return RadixTreeIsShared(tree)
+ ? (rt_node *) dsa_get_address(tree->dsa, (dsa_pointer) nodep)
+ : (rt_node *) nodep;
+}
+
/*
* Return index of the first element in 'base' that equals 'key'. Return -1
* if there is no such element.
@@ -550,10 +604,10 @@ node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
/* Shift the elements right at 'idx' by one */
static inline void
-chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_shift(uint8 *chunks, rt_node_ptr *children, int count, int idx)
{
memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
- memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node_ptr) * (count - idx));
}
static inline void
@@ -565,7 +619,7 @@ chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
/* Delete the element at 'idx' */
static inline void
-chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_delete(uint8 *chunks, rt_node_ptr *children, int count, int idx)
{
memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
@@ -580,15 +634,15 @@ chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
/* Copy both chunks and children/values arrays */
static inline void
-chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
- uint8 *dst_chunks, rt_node **dst_children, int count)
+chunk_children_array_copy(uint8 *src_chunks, rt_node_ptr *src_children,
+ uint8 *dst_chunks, rt_node_ptr *dst_children, int count)
{
/* For better code generation */
if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
pg_unreachable();
memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
- memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node_ptr) * count);
}
static inline void
@@ -617,7 +671,7 @@ static inline bool
node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
{
Assert(!NODE_IS_LEAF(node));
- return (node->children[slot] != NULL);
+ return RTNodePtrIsValid(node->children[slot]);
}
static inline bool
@@ -627,7 +681,7 @@ node_leaf_128_is_slot_used(rt_node_leaf_128 *node, uint8 slot)
return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
}
-static inline rt_node *
+static inline rt_node_ptr
node_inner_128_get_child(rt_node_inner_128 *node, uint8 chunk)
{
Assert(!NODE_IS_LEAF(node));
@@ -695,7 +749,7 @@ node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
}
static inline void
-node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_node_ptr child)
{
int slotpos;
@@ -726,10 +780,10 @@ node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
/* Update the child corresponding to 'chunk' to 'child' */
static inline void
-node_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+node_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node_ptr childp)
{
Assert(!NODE_IS_LEAF(node));
- node->children[node->base.slot_idxs[chunk]] = child;
+ node->children[node->base.slot_idxs[chunk]] = childp;
}
static inline void
@@ -746,7 +800,7 @@ static inline bool
node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
{
Assert(!NODE_IS_LEAF(node));
- return (node->children[chunk] != NULL);
+ return RTNodePtrIsValid(node->children[chunk]);
}
static inline bool
@@ -756,7 +810,7 @@ node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
}
-static inline rt_node *
+static inline rt_node_ptr
node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
{
Assert(!NODE_IS_LEAF(node));
@@ -774,7 +828,7 @@ node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
/* Set the child in the node-256 */
static inline void
-node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node_ptr child)
{
Assert(!NODE_IS_LEAF(node));
node->children[chunk] = child;
@@ -794,7 +848,7 @@ static inline void
node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
{
Assert(!NODE_IS_LEAF(node));
- node->children[chunk] = NULL;
+ node->children[chunk] = InvalidRTNodePointer;
}
static inline void
@@ -835,28 +889,45 @@ static void
rt_new_root(radix_tree *tree, uint64 key)
{
int shift = key_get_shift(key);
- rt_node *node;
+ rt_node_ptr nodep;
- node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0,
- shift > 0);
- tree->max_val = shift_get_max_val(shift);
- tree->root = node;
+ nodep = rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0, shift > 0);
+ tree->ctl->max_val = shift_get_max_val(shift);
+ tree->ctl->root = nodep;
}
/*
* Allocate a new node with the given node kind.
*/
-static rt_node *
+static rt_node_ptr
rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
{
rt_node *newnode;
+ rt_node_ptr newnodep;
+
+ if (tree->dsa != NULL)
+ {
+ dsa_pointer dp;
+
+ if (inner)
+ dp = dsa_allocate0(tree->dsa, rt_node_kind_info[kind].inner_size);
+ else
+ dp = dsa_allocate0(tree->dsa, rt_node_kind_info[kind].leaf_size);
- if (inner)
- newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
- rt_node_kind_info[kind].inner_size);
+ newnodep = (rt_node_ptr) dp;
+ newnode = (rt_node *) dsa_get_address(tree->dsa, newnodep);
+ }
else
- newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
- rt_node_kind_info[kind].leaf_size);
+ {
+ if (inner)
+ newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_kind_info[kind].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_kind_info[kind].leaf_size);
+
+ newnodep = (rt_node_ptr) newnode;
+ }
newnode->kind = kind;
newnode->shift = shift;
@@ -872,69 +943,81 @@ rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
#ifdef RT_DEBUG
/* update the statistics */
- tree->cnt[kind]++;
+ tree->ctl->cnt[kind]++;
#endif
- return newnode;
+ return newnodep;
}
/*
* Create a new node with 'new_kind' and the same shift, chunk, and
* count of 'node'.
*/
-static rt_node *
+static rt_node_ptr
rt_copy_node(radix_tree *tree, rt_node *node, int new_kind)
{
rt_node *newnode;
+ rt_node_ptr newnodep;
- newnode = rt_alloc_node(tree, new_kind, node->shift, node->chunk,
- node->shift > 0);
+ newnodep = rt_alloc_node(tree, new_kind, node->shift, node->chunk,
+ node->shift > 0);
+ newnode = node_ptr_get_local(tree, newnodep);
newnode->count = node->count;
- return newnode;
+ return newnodep;
}
/* Free the given node */
static void
-rt_free_node(radix_tree *tree, rt_node *node)
+rt_free_node(radix_tree *tree, rt_node_ptr nodep)
{
/* If we're deleting the root node, make the tree empty */
- if (tree->root == node)
- tree->root = NULL;
+ if (tree->ctl->root == nodep)
+ tree->ctl->root = InvalidRTNodePointer;
#ifdef RT_DEBUG
- /* update the statistics */
- tree->cnt[node->kind]--;
- Assert(tree->cnt[node->kind] >= 0);
+ {
+ rt_node *node = node_ptr_get_local(tree, nodep);
+
+ /* update the statistics */
+ tree->ctl->cnt[node->kind]--;
+ Assert(tree->ctl->cnt[node->kind] >= 0);
+ }
#endif
- pfree(node);
+ if (RadixTreeIsShared(tree))
+ dsa_free(tree->dsa, (dsa_pointer) nodep);
+ else
+ pfree((rt_node *) nodep);
}
/*
* Replace old_child with new_child, and free the old one.
*/
static void
-rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
- rt_node *new_child, uint64 key)
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node_ptr oldp,
+ rt_node_ptr newp, uint64 key)
{
- Assert(old_child->chunk == new_child->chunk);
- Assert(old_child->shift == new_child->shift);
+ rt_node *old = node_ptr_get_local(tree, oldp);
- if (parent == old_child)
+#ifdef USE_ASSERT_CHECKING
{
- /* Replace the root node with the new large node */
- tree->root = new_child;
+ rt_node *new = node_ptr_get_local(tree, newp);
+
+ Assert(old->chunk == new->chunk);
+ Assert(old->shift == new->shift);
}
- else
- {
- bool replaced PG_USED_FOR_ASSERTS_ONLY;
+#endif
- replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
- Assert(replaced);
+ if (parent == old)
+ {
+ /* Replace the root node with the new large node */
+ tree->ctl->root = newp;
}
+ else
+ rt_node_update_inner(parent, key, newp);
- rt_free_node(tree, old_child);
+ rt_free_node(tree, oldp);
}
/*
@@ -945,7 +1028,8 @@ static void
rt_extend(radix_tree *tree, uint64 key)
{
int target_shift;
- int shift = tree->root->shift + RT_NODE_SPAN;
+ rt_node *root = node_ptr_get_local(tree, tree->ctl->root);
+ int shift = root->shift + RT_NODE_SPAN;
target_shift = key_get_shift(key);
@@ -953,20 +1037,77 @@ rt_extend(radix_tree *tree, uint64 key)
while (shift <= target_shift)
{
rt_node_inner_4 *node;
+ rt_node_ptr nodep;
- node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4,
- shift, 0, true);
+ /* create the new root */
+ nodep = rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0, true);
+ node = (rt_node_inner_4 *) node_ptr_get_local(tree, nodep);
node->base.n.count = 1;
node->base.chunks[0] = 0;
- node->children[0] = tree->root;
+ node->children[0] = tree->ctl->root;
- tree->root->chunk = 0;
- tree->root = (rt_node *) node;
+ /* Update the root */
+ root->chunk = 0;
+ tree->ctl->root = nodep;
+ root = (rt_node *) node;
shift += RT_NODE_SPAN;
}
- tree->max_val = shift_get_max_val(target_shift);
+ tree->ctl->max_val = shift_get_max_val(target_shift);
+}
+
+/* XXX: can be merged to rt_node_search_inner with RT_ACTION_UPDATE? */
+static inline void
+rt_node_update_inner(rt_node *node, uint64 key, rt_node_ptr newchildp)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < -1)
+ break;
+
+ n4->children[idx] = newchildp;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < -1)
+ break;
+
+ n32->children[idx] = newchildp;
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ break;
+
+ node_inner_128_update(n128, chunk, newchildp);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, chunk))
+ break;
+
+ node_inner_256_set(n256, chunk, newchildp);
+ break;
+ }
+ }
}
/*
@@ -975,27 +1116,31 @@ rt_extend(radix_tree *tree, uint64 key)
*/
static inline void
rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
- rt_node *node)
+ rt_node_ptr nodep, rt_node *node)
{
int shift = node->shift;
+ Assert(node_ptr_get_local(tree, nodep) == node);
+
while (shift >= RT_NODE_SPAN)
{
- rt_node *newchild;
+ rt_node_ptr newchildp;
int newshift = shift - RT_NODE_SPAN;
- newchild = rt_alloc_node(tree, RT_NODE_KIND_4, newshift,
- RT_GET_KEY_CHUNK(key, node->shift),
- newshift > 0);
- rt_node_insert_inner(tree, parent, node, key, newchild);
+ newchildp = rt_alloc_node(tree, RT_NODE_KIND_4, newshift,
+ RT_GET_KEY_CHUNK(key, node->shift),
+ newshift > 0);
+
+ rt_node_insert_inner(tree, parent, nodep, node, key, newchildp);
parent = node;
- node = newchild;
+ node = node_ptr_get_local(tree, newchildp);
+ nodep = newchildp;
shift -= RT_NODE_SPAN;
}
- rt_node_insert_leaf(tree, parent, node, key, value);
- tree->num_keys++;
+ rt_node_insert_leaf(tree, parent, nodep, node, key, value);
+ tree->ctl->num_keys++;
}
/*
@@ -1006,11 +1151,11 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
* pointer is set to child_p.
*/
static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node_ptr *childp_p)
{
uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
bool found = false;
- rt_node *child = NULL;
+ rt_node_ptr childp = InvalidRTNodePointer;
switch (node->kind)
{
@@ -1025,7 +1170,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
found = true;
if (action == RT_ACTION_FIND)
- child = n4->children[idx];
+ childp = n4->children[idx];
else /* RT_ACTION_DELETE */
chunk_children_array_delete(n4->base.chunks, n4->children,
n4->base.n.count, idx);
@@ -1041,8 +1186,9 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
break;
found = true;
+
if (action == RT_ACTION_FIND)
- child = n32->children[idx];
+ childp = n32->children[idx];
else /* RT_ACTION_DELETE */
chunk_children_array_delete(n32->base.chunks, n32->children,
n32->base.n.count, idx);
@@ -1058,7 +1204,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
found = true;
if (action == RT_ACTION_FIND)
- child = node_inner_128_get_child(n128, chunk);
+ childp = node_inner_128_get_child(n128, chunk);
else /* RT_ACTION_DELETE */
node_inner_128_delete(n128, chunk);
@@ -1073,7 +1219,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
found = true;
if (action == RT_ACTION_FIND)
- child = node_inner_256_get_child(n256, chunk);
+ childp = node_inner_256_get_child(n256, chunk);
else /* RT_ACTION_DELETE */
node_inner_256_delete(n256, chunk);
@@ -1085,8 +1231,8 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
if (action == RT_ACTION_DELETE && found)
node->count--;
- if (found && child_p)
- *child_p = child;
+ if (found && childp_p)
+ *childp_p = childp;
return found;
}
@@ -1186,8 +1332,8 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
/* Insert the child to the inner node */
static bool
-rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
- rt_node *child)
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node_ptr nodep, rt_node *node,
+ uint64 key, rt_node_ptr childp)
{
uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
bool chunk_exists = false;
@@ -1206,23 +1352,24 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
{
/* found the existing chunk */
chunk_exists = true;
- n4->children[idx] = child;
+ n4->children[idx] = childp;
break;
}
if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
{
rt_node_inner_32 *new32;
+ rt_node_ptr new32p;
/* grow node from 4 to 32 */
- new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32);
+ new32 = (rt_node_inner_32 *) node_ptr_get_local(tree, new32p);
+
chunk_children_array_copy(n4->base.chunks, n4->children,
new32->base.chunks, new32->children,
n4->base.n.count);
- rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
- key);
+ rt_replace_node(tree, parent, nodep, new32p, key);
node = (rt_node *) new32;
}
else
@@ -1236,7 +1383,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
count, insertpos);
n4->base.chunks[insertpos] = chunk;
- n4->children[insertpos] = child;
+ n4->children[insertpos] = childp;
break;
}
}
@@ -1251,22 +1398,23 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
{
/* found the existing chunk */
chunk_exists = true;
- n32->children[idx] = child;
+ n32->children[idx] = childp;
break;
}
if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
{
rt_node_inner_128 *new128;
+ rt_node_ptr new128p;
/* grow node from 32 to 128 */
- new128 = (rt_node_inner_128 *) rt_copy_node(tree, (rt_node *) n32,
- RT_NODE_KIND_128);
+ new128p = rt_copy_node(tree, (rt_node *) n32, RT_NODE_KIND_128);
+ new128 = (rt_node_inner_128 *) node_ptr_get_local(tree, new128p);
+
for (int i = 0; i < n32->base.n.count; i++)
node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
- key);
+ rt_replace_node(tree, parent, nodep, new128p, key);
node = (rt_node *) new128;
}
else
@@ -1279,7 +1427,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
count, insertpos);
n32->base.chunks[insertpos] = chunk;
- n32->children[insertpos] = child;
+ n32->children[insertpos] = childp;
break;
}
}
@@ -1293,17 +1441,19 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
{
/* found the existing chunk */
chunk_exists = true;
- node_inner_128_update(n128, chunk, child);
+ node_inner_128_update(n128, chunk, childp);
break;
}
if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
{
rt_node_inner_256 *new256;
+ rt_node_ptr new256p;
/* grow node from 128 to 256 */
- new256 = (rt_node_inner_256 *) rt_copy_node(tree, (rt_node *) n128,
- RT_NODE_KIND_256);
+ new256p = rt_copy_node(tree, (rt_node *) n128, RT_NODE_KIND_256);
+ new256 = (rt_node_inner_256 *) node_ptr_get_local(tree, new256p);
+
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
{
if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
@@ -1313,13 +1463,12 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
cnt++;
}
- rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
- key);
+ rt_replace_node(tree, parent, nodep, new256p, key);
node = (rt_node *) new256;
}
else
{
- node_inner_128_insert(n128, chunk, child);
+ node_inner_128_insert(n128, chunk, childp);
break;
}
}
@@ -1331,7 +1480,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
- node_inner_256_set(n256, chunk, child);
+ node_inner_256_set(n256, chunk, childp);
break;
}
}
@@ -1351,7 +1500,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
/* Insert the value to the leaf node */
static bool
-rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node_ptr nodep, rt_node *node,
uint64 key, uint64 value)
{
uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
@@ -1378,16 +1527,16 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
{
rt_node_leaf_32 *new32;
+ rt_node_ptr new32p;
/* grow node from 4 to 32 */
- new32 = (rt_node_leaf_32 *) rt_copy_node(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32);
+ new32 = (rt_node_leaf_32 *) node_ptr_get_local(tree, new32p);
chunk_values_array_copy(n4->base.chunks, n4->values,
new32->base.chunks, new32->values,
n4->base.n.count);
- rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
- key);
+ rt_replace_node(tree, parent, nodep, new32p, key);
node = (rt_node *) new32;
}
else
@@ -1423,15 +1572,16 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
{
rt_node_leaf_128 *new128;
+ rt_node_ptr new128p;
/* grow node from 32 to 128 */
- new128 = (rt_node_leaf_128 *) rt_copy_node(tree, (rt_node *) n32,
- RT_NODE_KIND_128);
+ new128p = rt_copy_node(tree, (rt_node *) n32, RT_NODE_KIND_128);
+ new128 = (rt_node_leaf_128 *) node_ptr_get_local(tree, new128p);
+
for (int i = 0; i < n32->base.n.count; i++)
node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
- key);
+ rt_replace_node(tree, parent, nodep, new128p, key);
node = (rt_node *) new128;
}
else
@@ -1465,10 +1615,12 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
{
rt_node_leaf_256 *new256;
+ rt_node_ptr new256p;
/* grow node from 128 to 256 */
- new256 = (rt_node_leaf_256 *) rt_copy_node(tree, (rt_node *) n128,
- RT_NODE_KIND_256);
+ new256p = rt_copy_node(tree, (rt_node *) n128, RT_NODE_KIND_256);
+ new256 = (rt_node_leaf_256 *) node_ptr_get_local(tree, new256p);
+
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
{
if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
@@ -1478,8 +1630,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
cnt++;
}
- rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
- key);
+ rt_replace_node(tree, parent, nodep, new256p, key);
node = (rt_node *) new256;
}
else
@@ -1518,33 +1669,46 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
* Create the radix tree in the given memory context and return it.
*/
radix_tree *
-rt_create(MemoryContext ctx)
+rt_create(MemoryContext ctx, dsa_area *dsa)
{
radix_tree *tree;
MemoryContext old_ctx;
old_ctx = MemoryContextSwitchTo(ctx);
- tree = palloc(sizeof(radix_tree));
+ tree = (radix_tree *) palloc0(sizeof(radix_tree));
tree->context = ctx;
- tree->root = NULL;
- tree->max_val = 0;
- tree->num_keys = 0;
+
+ if (dsa != NULL)
+ {
+ tree->dsa = dsa;
+ tree->ctl_dp = dsa_allocate0(dsa, sizeof(radix_tree_control));
+ tree->ctl = (radix_tree_control *) dsa_get_address(dsa, tree->ctl_dp);
+ }
+ else
+ {
+ tree->ctl_dp = InvalidDsaPointer;
+ tree->ctl = (radix_tree_control *) palloc0(sizeof(radix_tree_control));
+ }
+
+ tree->ctl->root = InvalidRTNodePointer;
+ tree->ctl->max_val = 0;
+ tree->ctl->num_keys = 0;
/* Create the slab allocator for each size class */
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ if (dsa == NULL)
{
- tree->inner_slabs[i] = SlabContextCreate(ctx,
- rt_node_kind_info[i].name,
- rt_node_kind_info[i].inner_blocksize,
- rt_node_kind_info[i].inner_size);
- tree->leaf_slabs[i] = SlabContextCreate(ctx,
- rt_node_kind_info[i].name,
- rt_node_kind_info[i].leaf_blocksize,
- rt_node_kind_info[i].leaf_size);
-#ifdef RT_DEBUG
- tree->cnt[i] = 0;
-#endif
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].inner_blocksize,
+ rt_node_kind_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].leaf_blocksize,
+ rt_node_kind_info[i].leaf_size);
+ }
}
MemoryContextSwitchTo(old_ctx);
@@ -1552,16 +1716,48 @@ rt_create(MemoryContext ctx)
return tree;
}
+dsa_pointer
+rt_get_dsa_pointer(radix_tree *tree)
+{
+ return tree->ctl_dp;
+}
+
+radix_tree *
+rt_attach(dsa_area *dsa, dsa_pointer dp)
+{
+ radix_tree *tree;
+
+ /* XXX: memory context support */
+ tree = (radix_tree *) palloc0(sizeof(radix_tree));
+
+ tree->ctl_dp = dp;
+ tree->ctl = (radix_tree_control *) dsa_get_address(dsa, dp);
+
+ /* XXX: do we need to set a callback on exit to detach dsa? */
+
+ return tree;
+}
+
/*
* Free the given radix tree.
*/
void
rt_free(radix_tree *tree)
{
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ if (RadixTreeIsShared(tree))
+ {
+ dsa_free(tree->dsa, tree->ctl_dp);
+ dsa_detach(tree->dsa);
+ }
+ else
{
- MemoryContextDelete(tree->inner_slabs[i]);
- MemoryContextDelete(tree->leaf_slabs[i]);
+ pfree(tree->ctl);
+
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
}
pfree(tree);
@@ -1576,48 +1772,48 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
{
int shift;
bool updated;
+ rt_node *parent;
rt_node *node;
- rt_node *parent = tree->root;
+ rt_node_ptr nodep;
/* Empty tree, create the root */
- if (!tree->root)
+ if (!RTNodePtrIsValid(tree->ctl->root))
rt_new_root(tree, key);
/* Extend the tree if necessary */
- if (key > tree->max_val)
+ if (key > tree->ctl->max_val)
rt_extend(tree, key);
- Assert(tree->root);
-
- shift = tree->root->shift;
- node = tree->root;
+ parent = node_ptr_get_local(tree, tree->ctl->root);
+ nodep = tree->ctl->root;
+ shift = parent->shift;
/* Descend the tree until a leaf node */
while (shift >= 0)
{
- rt_node *child;
+ rt_node_ptr childp;
+
+ node = node_ptr_get_local(tree, nodep);
if (NODE_IS_LEAF(node))
break;
- if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &childp))
{
- rt_set_extend(tree, key, value, parent, node);
+ rt_set_extend(tree, key, value, parent, nodep, node);
return false;
}
- Assert(child);
-
parent = node;
- node = child;
+ nodep = childp;
shift -= RT_NODE_SPAN;
}
- updated = rt_node_insert_leaf(tree, parent, node, key, value);
+ updated = rt_node_insert_leaf(tree, parent, nodep, node, key, value);
/* Update the statistics */
if (!updated)
- tree->num_keys++;
+ tree->ctl->num_keys++;
return updated;
}
@@ -1635,24 +1831,24 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
Assert(value_p != NULL);
- if (!tree->root || key > tree->max_val)
+ if (!RTNodePtrIsValid(tree->ctl->root) || key > tree->ctl->max_val)
return false;
- node = tree->root;
- shift = tree->root->shift;
+ node = node_ptr_get_local(tree, tree->ctl->root);
+ shift = node->shift;
/* Descend the tree until a leaf node */
while (shift >= 0)
{
- rt_node *child;
+ rt_node_ptr childp;
if (NODE_IS_LEAF(node))
break;
- if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &childp))
return false;
- node = child;
+ node = node_ptr_get_local(tree, childp);
shift -= RT_NODE_SPAN;
}
@@ -1667,37 +1863,40 @@ bool
rt_delete(radix_tree *tree, uint64 key)
{
rt_node *node;
- rt_node *stack[RT_MAX_LEVEL] = {0};
+ rt_node_ptr nodep;
+ rt_node_ptr stack[RT_MAX_LEVEL] = {0};
int shift;
int level;
bool deleted;
- if (!tree->root || key > tree->max_val)
+ if (!RTNodePtrIsValid(tree->ctl->root) || key > tree->ctl->max_val)
return false;
/*
* Descend the tree to search the key while building a stack of nodes we
* visited.
*/
- node = tree->root;
- shift = tree->root->shift;
+ nodep = tree->ctl->root;
+ node = node_ptr_get_local(tree, nodep);
+ shift = node->shift;
level = -1;
while (shift > 0)
{
- rt_node *child;
+ rt_node_ptr childp;
/* Push the current node to the stack */
- stack[++level] = node;
+ stack[++level] = nodep;
+ node = node_ptr_get_local(tree, nodep);
- if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &childp))
return false;
- node = child;
+ nodep = childp;
shift -= RT_NODE_SPAN;
}
/* Delete the key from the leaf node if exists */
- Assert(NODE_IS_LEAF(node));
+ node = node_ptr_get_local(tree, nodep);
deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
if (!deleted)
@@ -1707,7 +1906,7 @@ rt_delete(radix_tree *tree, uint64 key)
}
/* Found the key to delete. Update the statistics */
- tree->num_keys--;
+ tree->ctl->num_keys--;
/*
* Return if the leaf node still has keys and we don't need to delete the
@@ -1717,12 +1916,13 @@ rt_delete(radix_tree *tree, uint64 key)
return true;
/* Free the empty leaf node */
- rt_free_node(tree, node);
+ rt_free_node(tree, nodep);
/* Delete the key in inner nodes recursively */
while (level >= 0)
{
- node = stack[level--];
+ nodep = stack[level--];
+ node = node_ptr_get_local(tree, nodep);
deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
Assert(deleted);
@@ -1732,7 +1932,7 @@ rt_delete(radix_tree *tree, uint64 key)
break;
/* The node became empty */
- rt_free_node(tree, node);
+ rt_free_node(tree, nodep);
}
/*
@@ -1741,8 +1941,8 @@ rt_delete(radix_tree *tree, uint64 key)
*/
if (level == 0)
{
- tree->root = NULL;
- tree->max_val = 0;
+ tree->ctl->root = InvalidRTNodePointer;
+ tree->ctl->max_val = 0;
}
return true;
@@ -1753,6 +1953,7 @@ rt_iter *
rt_begin_iterate(radix_tree *tree)
{
MemoryContext old_ctx;
+ rt_node *root;
rt_iter *iter;
int top_level;
@@ -1765,14 +1966,15 @@ rt_begin_iterate(radix_tree *tree)
if (!iter->tree)
return iter;
- top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ root = node_ptr_get_local(tree, tree->ctl->root);
+ top_level = root->shift / RT_NODE_SPAN;
iter->stack_len = top_level;
/*
* Descend to the left most leaf node from the root. The key is being
* constructed while descending to the leaf.
*/
- rt_update_iter_stack(iter, iter->tree->root, top_level);
+ rt_update_iter_stack(iter, root, top_level);
MemoryContextSwitchTo(old_ctx);
@@ -1792,7 +1994,6 @@ rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
{
rt_node_iter *node_iter = &(iter->stack[level--]);
- /* Set the node to this level */
node_iter->node = node;
node_iter->current_idx = -1;
@@ -1828,7 +2029,6 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
/* Advance the leaf node iterator to get next key-value pair */
found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
-
if (found)
{
*key_p = iter->key;
@@ -1898,7 +2098,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
if (node_iter->current_idx >= n4->base.n.count)
break;
- child = n4->children[node_iter->current_idx];
+ child = node_ptr_get_local(iter->tree, n4->children[node_iter->current_idx]);
key_chunk = n4->base.chunks[node_iter->current_idx];
found = true;
break;
@@ -1911,7 +2111,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
if (node_iter->current_idx >= n32->base.n.count)
break;
- child = n32->children[node_iter->current_idx];
+ child = node_ptr_get_local(iter->tree, n32->children[node_iter->current_idx]);
key_chunk = n32->base.chunks[node_iter->current_idx];
found = true;
break;
@@ -1931,7 +2131,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
break;
node_iter->current_idx = i;
- child = node_inner_128_get_child(n128, i);
+ child = node_ptr_get_local(iter->tree, node_inner_128_get_child(n128, i));
key_chunk = i;
found = true;
break;
@@ -1951,7 +2151,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
break;
node_iter->current_idx = i;
- child = node_inner_256_get_child(n256, i);
+ child = node_ptr_get_local(iter->tree, node_inner_256_get_child(n256, i));
key_chunk = i;
found = true;
break;
@@ -2062,7 +2262,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
uint64
rt_num_entries(radix_tree *tree)
{
- return tree->num_keys;
+ return tree->ctl->num_keys;
}
/*
@@ -2071,12 +2271,17 @@ rt_num_entries(radix_tree *tree)
uint64
rt_memory_usage(radix_tree *tree)
{
- Size total = sizeof(radix_tree);
+ Size total = sizeof(radix_tree) + sizeof(radix_tree_control);
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ if (RadixTreeIsShared(tree))
+ total = dsa_get_total_size(tree->dsa);
+ else
{
- total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
- total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
}
return total;
@@ -2161,17 +2366,18 @@ void
rt_stats(radix_tree *tree)
{
ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
- tree->num_keys,
- tree->root->shift / RT_NODE_SPAN,
- tree->cnt[0],
- tree->cnt[1],
- tree->cnt[2],
- tree->cnt[3])));
+ tree->ctl->num_keys,
+ node_ptr_get_local(tree, tree->ctl->root)->shift / RT_NODE_SPAN,
+ tree->ctl->cnt[0],
+ tree->ctl->cnt[1],
+ tree->ctl->cnt[2],
+ tree->ctl->cnt[3])));
}
static void
-rt_dump_node(rt_node *node, int level, bool recurse)
+rt_dump_node(radix_tree *tree, rt_node_ptr nodep, int level, bool recurse)
{
+ rt_node *node = node_ptr_get_local(tree, nodep);
char space[128] = {0};
fprintf(stderr, "[%s] kind %d, count %u, shift %u, chunk 0x%X:\n",
@@ -2205,7 +2411,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
space, n4->base.chunks[i]);
if (recurse)
- rt_dump_node(n4->children[i], level + 1, recurse);
+ rt_dump_node(tree, n4->children[i], level + 1, recurse);
else
fprintf(stderr, "\n");
}
@@ -2232,7 +2438,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
if (recurse)
{
- rt_dump_node(n32->children[i], level + 1, recurse);
+ rt_dump_node(tree, n32->children[i], level + 1, recurse);
}
else
fprintf(stderr, "\n");
@@ -2284,7 +2490,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(node_inner_128_get_child(n128, i),
+ rt_dump_node(tree, node_inner_128_get_child(n128, i),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2317,8 +2523,8 @@ rt_dump_node(rt_node *node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
- recurse);
+ rt_dump_node(tree, node_inner_256_get_child(n256, i),
+ level + 1, recurse);
else
fprintf(stderr, "\n");
}
@@ -2328,6 +2534,28 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
}
+void
+rt_dump(radix_tree *tree)
+{
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].inner_size,
+ rt_node_kind_info[i].inner_blocksize,
+ rt_node_kind_info[i].leaf_size,
+ rt_node_kind_info[i].leaf_blocksize);
+ fprintf(stderr, "max_val = %lu\n", tree->ctl->max_val);
+
+ if (!tree->ctl->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ rt_dump_node(tree, tree->ctl->root, 0, true);
+}
+
+#ifdef unused
void
rt_dump_search(radix_tree *tree, uint64 key)
{
@@ -2336,23 +2564,23 @@ rt_dump_search(radix_tree *tree, uint64 key)
int level = 0;
elog(NOTICE, "-----------------------------------------------------------");
- elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+ elog(NOTICE, "max_val = %lu (0x%lX)", tree->ctl->max_val, tree->ctl->max_val);
- if (!tree->root)
+ if (!tree->ctl->root)
{
elog(NOTICE, "tree is empty");
return;
}
- if (key > tree->max_val)
+ if (key > tree->ctl->max_val)
{
elog(NOTICE, "key %lu (0x%lX) is larger than max val",
key, key);
return;
}
- node = tree->root;
- shift = tree->root->shift;
+ node = tree->ctl->root;
+ shift = tree->ctl->root->shift;
while (shift >= 0)
{
rt_node *child;
@@ -2377,25 +2605,6 @@ rt_dump_search(radix_tree *tree, uint64 key)
level++;
}
}
+#endif
-void
-rt_dump(radix_tree *tree)
-{
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
- fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
- rt_node_kind_info[i].name,
- rt_node_kind_info[i].inner_size,
- rt_node_kind_info[i].inner_blocksize,
- rt_node_kind_info[i].leaf_size,
- rt_node_kind_info[i].leaf_blocksize);
- fprintf(stderr, "max_val = %lu\n", tree->max_val);
-
- if (!tree->root)
- {
- fprintf(stderr, "empty tree\n");
- return;
- }
-
- rt_dump_node(tree->root, 0, true);
-}
#endif
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 82376fde2d..ad169882af 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d5d7668617..d9d8355c21 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -14,18 +14,22 @@
#define RADIXTREE_H
#include "postgres.h"
+#include "utils/dsa.h"
#define RT_DEBUG 1
typedef struct radix_tree radix_tree;
typedef struct rt_iter rt_iter;
-extern radix_tree *rt_create(MemoryContext ctx);
+extern radix_tree *rt_create(MemoryContext ctx, dsa_area *dsa);
extern void rt_free(radix_tree *tree);
extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
extern rt_iter *rt_begin_iterate(radix_tree *tree);
+extern dsa_pointer rt_get_dsa_pointer(radix_tree *tree);
+extern radix_tree *rt_attach(dsa_area *dsa, dsa_pointer dp);
+
extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
extern void rt_end_iterate(rt_iter *iter);
extern bool rt_delete(radix_tree *tree, uint64 key);
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 405606fe2f..dad06adecc 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index cc6970c87c..a0ff1e1c77 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -5,21 +5,38 @@ CREATE EXTENSION test_radixtree;
--
SELECT test_radixtree();
NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
NOTICE: testing radix tree node types with shift "8"
NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "16"
NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
NOTICE: testing radix tree node types with shift "32"
NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
NOTICE: testing radix tree node types with shift "48"
NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
NOTICE: testing radix tree with pattern "all ones"
NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
NOTICE: testing radix tree with pattern "clusters of ten"
NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
NOTICE: testing radix tree with pattern "one-every-64k"
NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
NOTICE: testing radix tree with pattern "single values, distance > 2^32"
NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
test_radixtree
----------------
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index cb3596755d..a08495834e 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -19,6 +19,7 @@
#include "nodes/bitmapset.h"
#include "storage/block.h"
#include "storage/itemptr.h"
+#include "storage/lwlock.h"
#include "utils/memutils.h"
#include "utils/timestamp.h"
@@ -111,7 +112,7 @@ test_empty(void)
radix_tree *radixtree;
uint64 dummy;
- radixtree = rt_create(CurrentMemoryContext);
+ radixtree = rt_create(CurrentMemoryContext, NULL);
if (rt_search(radixtree, 0, &dummy))
elog(ERROR, "rt_search on empty tree returned true");
@@ -217,14 +218,10 @@ test_node_types_delete(radix_tree *radixtree, uint8 shift)
* level.
*/
static void
-test_node_types(uint8 shift)
+do_test_node_types(radix_tree *radixtree, uint8 shift)
{
- radix_tree *radixtree;
-
elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
- radixtree = rt_create(CurrentMemoryContext);
-
/*
* Insert and search entries for every node type at the 'shift' level,
* then delete all entries to make it empty, and insert and search entries
@@ -233,19 +230,38 @@ test_node_types(uint8 shift)
test_node_types_insert(radixtree, shift);
test_node_types_delete(radixtree, shift);
test_node_types_insert(radixtree, shift);
+}
- rt_free(radixtree);
+static void
+test_node_types(void)
+{
+ int tranche_id = LWLockNewTrancheId();
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ {
+ radix_tree *tree;
+ dsa_area *dsa;
+
+ /* Test the local radix tree */
+ tree = rt_create(CurrentMemoryContext, NULL);
+ do_test_node_types(tree, shift);
+ rt_free(tree);
+
+ /* Test the shared radix tree */
+ dsa = dsa_create(tranche_id);
+ tree = rt_create(CurrentMemoryContext, dsa);
+ do_test_node_types(tree, shift);
+ rt_free(tree);
+ }
}
/*
* Test with a repeating pattern, defined by the 'spec'.
*/
static void
-test_pattern(const test_spec * spec)
+do_test_pattern(radix_tree *radixtree, const test_spec * spec)
{
- radix_tree *radixtree;
rt_iter *iter;
- MemoryContext radixtree_ctx;
TimestampTz starttime;
TimestampTz endtime;
uint64 n;
@@ -271,18 +287,6 @@ test_pattern(const test_spec * spec)
pattern_values[pattern_num_values++] = i;
}
- /*
- * Allocate the radix tree.
- *
- * Allocate it in a separate memory context, so that we can print its
- * memory usage easily.
- */
- radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
- "radixtree test",
- ALLOCSET_SMALL_SIZES);
- MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
- radixtree = rt_create(radixtree_ctx);
-
/*
* Add values to the set.
*/
@@ -336,8 +340,6 @@ test_pattern(const test_spec * spec)
mem_usage = rt_memory_usage(radixtree);
fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
mem_usage, (double) mem_usage / spec->num_values);
-
- MemoryContextStats(radixtree_ctx);
}
/* Check that rt_num_entries works */
@@ -484,21 +486,53 @@ test_pattern(const test_spec * spec)
if ((nbefore - ndeleted) != nafter)
elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
nafter, (nbefore - ndeleted), ndeleted);
+}
+
+static void
+test_patterns(void)
+{
+ int tranche_id = LWLockNewTrancheId();
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ {
+ radix_tree *tree;
+ MemoryContext radixtree_ctx;
+ dsa_area *dsa;
+ const test_spec *spec = &test_specs[i];
- MemoryContextDelete(radixtree_ctx);
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+ /* Test the local radix tree */
+ tree = rt_create(radixtree_ctx, NULL);
+ do_test_pattern(tree, spec);
+ rt_free(tree);
+ MemoryContextReset(radixtree_ctx);
+
+ /* Test the shared radix tree */
+ dsa = dsa_create(tranche_id);
+ tree = rt_create(radixtree_ctx, dsa);
+ do_test_pattern(tree, spec);
+ rt_free(tree);
+ MemoryContextDelete(radixtree_ctx);
+ }
}
Datum
test_radixtree(PG_FUNCTION_ARGS)
{
test_empty();
-
- for (int shift = 0; shift <= (64 - 8); shift += 8)
- test_node_types(shift);
-
- /* Test different test patterns, with lots of entries */
- for (int i = 0; i < lengthof(test_specs); i++)
- test_pattern(&test_specs[i]);
+ test_node_types();
+ test_patterns();
PG_RETURN_VOID();
}
--
2.31.1
v8-0002-Add-radix-implementation.patchapplication/octet-stream; name=v8-0002-Add-radix-implementation.patchDownload
From 45a5a064b71dc6f58d333984a7a571cc3cd80e63 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v8 2/4] Add radix implementation.
---
src/backend/lib/Makefile | 1 +
src/backend/lib/meson.build | 1 +
src/backend/lib/radixtree.c | 2401 +++++++++++++++++
src/include/lib/radixtree.h | 42 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 28 +
src/test/modules/test_radixtree/meson.build | 34 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 504 ++++
.../test_radixtree/test_radixtree.control | 4 +
15 files changed, 3066 insertions(+)
create mode 100644 src/backend/lib/radixtree.c
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
integerset.o \
knapsack.o \
pairingheap.o \
+ radixtree.o \
rbtree.o \
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 48da1bddce..4303d306cd 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -9,4 +9,5 @@ backend_sources += files(
'knapsack.c',
'pairingheap.c',
'rbtree.c',
+ 'radixtree.c',
)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..b239b3c615
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2401 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves". We
+ * choose it to avoid an additional pointer traversal. It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create - Create a new, empty radix tree
+ * rt_free - Free the radix tree
+ * rt_search - Search a key-value pair
+ * rt_set - Set a key-value pair
+ * rt_delete - Delete a key-value pair
+ * rt_begin_iterate - Begin iterating through all key-value pairs
+ * rt_iterate_next - Return next key-value pair, if any
+ * rt_end_iter - End iteration
+ * rt_memory_usage - Get the memory usage
+ * rt_num_entries - Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-128 */
+#define RT_NODE_128_INVALID_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+ RT_ACTION_FIND = 0, /* find the key-value */
+ RT_ACTION_DELETE, /* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree node kinds.
+ *
+ * XXX: These are currently not well chosen. To reduce memory fragmentation
+ * smaller class should optimally fit neatly into the next larger class
+ * (except perhaps at the lowest end). Right now its
+ * 40/40 -> 296/286 -> 1288/1304 -> 2056/2088 bytes for inner nodes and
+ * leaf nodes, respectively, leading to large amount of allocator padding
+ * with aset.c. Hence the use of slab.
+ *
+ * XXX: need to have node-1 until there is no path compression optimization?
+ *
+ * XXX: need to explain why we choose these node types based on benchmark
+ * results etc.
+ */
+#define RT_NODE_KIND_4 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_128 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+ uint8 chunk;
+
+ /* Size kind of the node */
+ uint8 kind;
+} rt_node;
+#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
+#define NODE_HAS_FREE_SLOT(n) \
+ (((rt_node *) (n))->count < rt_node_kind_info[((rt_node *) (n))->kind].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+typedef struct rt_node_base_4
+{
+ rt_node n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+ rt_node n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-128 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 128 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base128
+{
+ rt_node n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+} rt_node_base_128;
+
+typedef struct rt_node_base256
+{
+ rt_node n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * There are separate from inner node size classes for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ * width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+ rt_node_base_4 base;
+
+ /* 4 children, for key chunks */
+ rt_node *children[4];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+ rt_node_base_4 base;
+
+ /* 4 values, for key chunks */
+ uint64 values[4];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+ rt_node_base_32 base;
+
+ /* 32 children, for key chunks */
+ rt_node *children[32];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+ rt_node_base_32 base;
+
+ /* 32 values, for key chunks */
+ uint64 values[32];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_128
+{
+ rt_node_base_128 base;
+
+ /* Slots for 128 children */
+ rt_node *children[128];
+} rt_node_inner_128;
+
+typedef struct rt_node_leaf_128
+{
+ rt_node_base_128 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(128)];
+
+ /* Slots for 128 values */
+ uint64 values[128];
+} rt_node_leaf_128;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+ rt_node_base_256 base;
+
+ /* Slots for 256 children */
+ rt_node *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+ rt_node_base_256 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ uint64 values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information of each size kinds */
+typedef struct rt_node_kind_info_elem
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+
+ /* slab block size */
+ Size inner_blocksize;
+ Size leaf_blocksize;
+} rt_node_kind_info_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * size, (size) * 32)
+static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
+
+ [RT_NODE_KIND_4] = {
+ .name = "radix tree node 4",
+ .fanout = 4,
+ .inner_size = sizeof(rt_node_inner_4),
+ .leaf_size = sizeof(rt_node_leaf_4),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4)),
+ },
+ [RT_NODE_KIND_32] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(rt_node_inner_32),
+ .leaf_size = sizeof(rt_node_leaf_32),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32)),
+ },
+ [RT_NODE_KIND_128] = {
+ .name = "radix tree node 128",
+ .fanout = 128,
+ .inner_size = sizeof(rt_node_inner_128),
+ .leaf_size = sizeof(rt_node_leaf_128),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128)),
+ },
+ [RT_NODE_KIND_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(rt_node_inner_256),
+ .leaf_size = sizeof(rt_node_leaf_256),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+ },
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+ rt_node *node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+ radix_tree *tree;
+
+ /* Track the iteration on nodes of each level */
+ rt_node_iter stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ rt_node *root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+ MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_NODE_KIND_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
+ bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+ rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+ uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+ uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ /* For better code generation */
+ if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ pg_unreachable();
+
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+ uint8 *dst_chunks, uint64 *dst_values, int count)
+{
+ /* For better code generation */
+ if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ pg_unreachable();
+
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_values, src_values, sizeof(uint64) * count);
+}
+
+/* Functions to manipulate inner and leaf node-128 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_128_is_chunk_used(rt_node_base_128 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return (node->children[slot] != NULL);
+}
+
+static inline bool
+node_leaf_128_is_slot_used(rt_node_leaf_128 *node, uint8 slot)
+{
+ Assert(NODE_IS_LEAF(node));
+ return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+static inline rt_node *
+node_inner_128_get_child(rt_node_inner_128 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_128_get_value(rt_node_leaf_128 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(((rt_node_base_128 *) node)->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+static void
+node_inner_128_delete(rt_node_inner_128 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+static void
+node_leaf_128_delete(rt_node_leaf_128 *node, uint8 chunk)
+{
+ int slotpos = node->base.slot_idxs[chunk];
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+ node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+/* Return an unused slot in node-128 */
+static int
+node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
+{
+ int slotpos = 0;
+
+ Assert(!NODE_IS_LEAF(node));
+ while (node_inner_128_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+static int
+node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ /* We iterate over the isset bitmap per byte then check each bit */
+ for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+ {
+ if (node->isset[slotpos] < 0xFF)
+ break;
+ }
+ Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+ slotpos *= BITS_PER_BYTE;
+ while (node_leaf_128_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+static inline void
+node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+{
+ int slotpos;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ /* find unused slot */
+ slotpos = node_inner_128_find_unused_slot(node, chunk);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ /* find unused slot */
+ slotpos = node_leaf_128_find_unused_slot(node, chunk);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+ node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+node_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+static inline void
+node_leaf_128_update(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(node_inner_256_is_chunk_used(node, chunk));
+ return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(node_leaf_256_is_chunk_used(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+ int shift = key_get_shift(key);
+ rt_node *node;
+
+ node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0,
+ shift > 0);
+ tree->max_val = shift_get_max_val(shift);
+ tree->root = node;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
+{
+ rt_node *newnode;
+
+ if (inner)
+ newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_kind_info[kind].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_kind_info[kind].leaf_size);
+
+ newnode->kind = kind;
+ newnode->shift = shift;
+ newnode->chunk = chunk;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_128)
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) newnode;
+
+ memset(n128->slot_idxs, RT_NODE_128_INVALID_IDX, sizeof(n128->slot_idxs));
+ }
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[kind]++;
+#endif
+
+ return newnode;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node *
+rt_copy_node(radix_tree *tree, rt_node *node, int new_kind)
+{
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, new_kind, node->shift, node->chunk,
+ node->shift > 0);
+ newnode->count = node->count;
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->root == node)
+ tree->root = NULL;
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[node->kind]--;
+ Assert(tree->cnt[node->kind] >= 0);
+#endif
+
+ pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+ rt_node *new_child, uint64 key)
+{
+ Assert(old_child->chunk == new_child->chunk);
+ Assert(old_child->shift == new_child->shift);
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new large node */
+ tree->root = new_child;
+ }
+ else
+ {
+ bool replaced PG_USED_FOR_ASSERTS_ONLY;
+
+ replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+ Assert(replaced);
+ }
+
+ rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+ int target_shift;
+ int shift = tree->root->shift + RT_NODE_SPAN;
+
+ target_shift = key_get_shift(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ rt_node_inner_4 *node;
+
+ node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4,
+ shift, 0, true);
+ node->base.n.count = 1;
+ node->base.chunks[0] = 0;
+ node->children[0] = tree->root;
+
+ tree->root->chunk = 0;
+ tree->root = (rt_node *) node;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+ rt_node *node)
+{
+ int shift = node->shift;
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ rt_node *newchild;
+ int newshift = shift - RT_NODE_SPAN;
+
+ newchild = rt_alloc_node(tree, RT_NODE_KIND_4, newshift,
+ RT_GET_KEY_CHUNK(key, node->shift),
+ newshift > 0);
+ rt_node_insert_inner(tree, parent, node, key, newchild);
+
+ parent = node;
+ node = newchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ rt_node_insert_leaf(tree, parent, node, key, value);
+ tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ rt_node *child = NULL;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = n4->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n4->base.chunks, n4->children,
+ n4->base.n.count, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = n32->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = node_inner_128_get_child(n128, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_128_delete(n128, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = node_inner_256_get_child(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ if (found && child_p)
+ *child_p = child;
+
+ return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ uint64 value = 0;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = n4->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+ n4->base.n.count, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = n32->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+ n32->base.n.count, idx);
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_128_get_value(n128, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_128_delete(n128, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_256_get_value(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ if (found && value_p)
+ *value_p = value;
+
+ return found;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+ rt_node *child)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->children[idx] = child;
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+ {
+ rt_node_inner_32 *new32;
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_children_array_copy(n4->base.chunks, n4->children,
+ new32->base.chunks, new32->children,
+ n4->base.n.count);
+
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+ key);
+ node = (rt_node *) new32;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ uint16 count = n4->base.n.count;
+
+ /* shift chunks and children */
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n4->base.chunks, n4->children,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->children[insertpos] = child;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->children[idx] = child;
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+ {
+ rt_node_inner_128 *new128;
+
+ /* grow node from 32 to 128 */
+ new128 = (rt_node_inner_128 *) rt_copy_node(tree, (rt_node *) n32,
+ RT_NODE_KIND_128);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+ key);
+ node = (rt_node *) new128;
+ }
+ else
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int16 count = n32->base.n.count;
+
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n32->base.chunks, n32->children,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->children[insertpos] = child;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_128:
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+ int cnt = 0;
+
+ if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ node_inner_128_update(n128, chunk, child);
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+ {
+ rt_node_inner_256 *new256;
+
+ /* grow node from 128 to 256 */
+ new256 = (rt_node_inner_256 *) rt_copy_node(tree, (rt_node *) n128,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+ {
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ continue;
+
+ node_inner_256_set(new256, i, node_inner_128_get_child(n128, i));
+ cnt++;
+ }
+
+ rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ else
+ {
+ node_inner_128_insert(n128, chunk, child);
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+
+ node_inner_256_set(n256, chunk, child);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(NODE_IS_LEAF(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->values[idx] = value;
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+ {
+ rt_node_leaf_32 *new32;
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_leaf_32 *) rt_copy_node(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_values_array_copy(n4->base.chunks, n4->values,
+ new32->base.chunks, new32->values,
+ n4->base.n.count);
+
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+ key);
+ node = (rt_node *) new32;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ int count = n4->base.n.count;
+
+ /* shift chunks and values */
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n4->base.chunks, n4->values,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->values[insertpos] = value;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->values[idx] = value;
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+ {
+ rt_node_leaf_128 *new128;
+
+ /* grow node from 32 to 128 */
+ new128 = (rt_node_leaf_128 *) rt_copy_node(tree, (rt_node *) n32,
+ RT_NODE_KIND_128);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+ key);
+ node = (rt_node *) new128;
+ }
+ else
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int count = n32->base.n.count;
+
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n32->base.chunks, n32->values,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->values[insertpos] = value;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_128:
+ {
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+ int cnt = 0;
+
+ if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ node_leaf_128_update(n128, chunk, value);
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+ {
+ rt_node_leaf_256 *new256;
+
+ /* grow node from 128 to 256 */
+ new256 = (rt_node_leaf_256 *) rt_copy_node(tree, (rt_node *) n128,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+ {
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ continue;
+
+ node_leaf_256_set(new256, i, node_leaf_128_get_value(n128, i));
+ cnt++;
+ }
+
+ rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ else
+ {
+ node_leaf_128_insert(n128, chunk, value);
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+
+ node_leaf_256_set(n256, chunk, value);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+ radix_tree *tree;
+ MemoryContext old_ctx;
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = palloc(sizeof(radix_tree));
+ tree->context = ctx;
+ tree->root = NULL;
+ tree->max_val = 0;
+ tree->num_keys = 0;
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].inner_blocksize,
+ rt_node_kind_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].leaf_blocksize,
+ rt_node_kind_info[i].leaf_size);
+#ifdef RT_DEBUG
+ tree->cnt[i] = 0;
+#endif
+ }
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+ int shift;
+ bool updated;
+ rt_node *node;
+ rt_node *parent = tree->root;
+
+ /* Empty tree, create the root */
+ if (!tree->root)
+ rt_new_root(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->max_val)
+ rt_extend(tree, key);
+
+ Assert(tree->root);
+
+ shift = tree->root->shift;
+ node = tree->root;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ {
+ rt_set_extend(tree, key, value, parent, node);
+ return false;
+ }
+
+ Assert(child);
+
+ parent = node;
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->num_keys++;
+
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+ rt_node *node;
+ int shift;
+
+ Assert(value_p != NULL);
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ node = tree->root;
+ shift = tree->root->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ rt_node *stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ node = tree->root;
+ shift = tree->root->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ rt_node *child;
+
+ /* Push the current node to the stack */
+ stack[++level] = node;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ Assert(NODE_IS_LEAF(node));
+ deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (!NODE_IS_EMPTY(node))
+ return true;
+
+ /* Free the empty leaf node */
+ rt_free_node(tree, node);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ node = stack[level--];
+
+ deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!NODE_IS_EMPTY(node))
+ break;
+
+ /* The node became empty */
+ rt_free_node(tree, node);
+ }
+
+ /*
+ * If we eventually deleted the root node while recursively deleting empty
+ * nodes, we make the tree empty.
+ */
+ if (level == 0)
+ {
+ tree->root = NULL;
+ tree->max_val = 0;
+ }
+
+ return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+ MemoryContext old_ctx;
+ rt_iter *iter;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (rt_iter *) palloc0(sizeof(rt_iter));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree)
+ return iter;
+
+ top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+ int level = from;
+ rt_node *node = from_node;
+
+ for (;;)
+ {
+ rt_node_iter *node_iter = &(iter->stack[level--]);
+
+ /* Set the node to this level */
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = rt_node_inner_iterate_next(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree)
+ return false;
+
+ for (;;)
+ {
+ rt_node *child = NULL;
+ uint64 value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ rt_update_iter_stack(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+ pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+ rt_node *child = NULL;
+ bool found = false;
+ uint8 key_chunk;
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+
+ child = n4->children[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+ child = n32->children[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_128_get_child(n128, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_inner_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_256_get_child(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+ return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p)
+{
+ rt_node *node = node_iter->node;
+ bool found = false;
+ uint64 value;
+ uint8 key_chunk;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+
+ value = n4->values[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+ value = n32->values[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_128_get_value(n128, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_leaf_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_256_get_value(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ *value_p = value;
+ }
+
+ return found;
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+ return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+ Size total = sizeof(radix_tree);
+
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+ for (int i = 1; i < n4->n.count; i++)
+ Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_128_is_chunk_used(n128, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ if (NODE_IS_LEAF(node))
+ Assert(node_leaf_128_is_slot_used((rt_node_leaf_128 *) node,
+ n128->slot_idxs[i]));
+ else
+ Assert(node_inner_128_is_slot_used((rt_node_inner_128 *) node,
+ n128->slot_idxs[i]));
+
+ cnt++;
+ }
+
+ Assert(n128->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+ cnt += pg_popcount32(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+ ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
+ tree->num_keys,
+ tree->root->shift / RT_NODE_SPAN,
+ tree->cnt[0],
+ tree->cnt[1],
+ tree->cnt[2],
+ tree->cnt[3])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+ char space[128] = {0};
+
+ fprintf(stderr, "[%s] kind %d, count %u, shift %u, chunk 0x%X:\n",
+ NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_128) ? 128 : 256,
+ node->count, node->shift, node->chunk);
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+ space, n4->base.chunks[i], n4->values[i]);
+ }
+ else
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n4->base.chunks[i]);
+
+ if (recurse)
+ rt_dump_node(n4->children[i], level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+ space, n32->base.chunks[i], n32->values[i]);
+ }
+ else
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ rt_dump_node(n32->children[i], level + 1, recurse);
+ }
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *b128 = (rt_node_base_128 *) node;
+
+ fprintf(stderr, "slot_idxs ");
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_128_is_chunk_used(b128, i))
+ continue;
+
+ fprintf(stderr, " [%d]=%d, ", i, b128->slot_idxs[i]);
+ }
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_128 *n = (rt_node_leaf_128 *) node;
+
+ fprintf(stderr, ", isset-bitmap:");
+ for (int i = 0; i < 16; i++)
+ {
+ fprintf(stderr, "%X ", (uint8) n->isset[i]);
+ }
+ fprintf(stderr, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_128_is_chunk_used(b128, i))
+ continue;
+
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) b128;
+
+ fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+ space, i, node_leaf_128_get_value(n128, i));
+ }
+ else
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) b128;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_128_get_child(n128, i),
+ level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+ space, i, node_leaf_256_get_value(n256, i));
+ }
+ else
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+ recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+ if (!tree->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->max_val)
+ {
+ elog(NOTICE, "key %lu (0x%lX) is larger than max val",
+ key, key);
+ return;
+ }
+
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ rt_dump_node(node, level, false);
+
+ if (NODE_IS_LEAF(node))
+ {
+ uint64 dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+ break;
+ }
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].inner_size,
+ rt_node_kind_info[i].inner_blocksize,
+ rt_node_kind_info[i].leaf_size,
+ rt_node_kind_info[i].leaf_blocksize);
+ fprintf(stderr, "max_val = %lu\n", tree->max_val);
+
+ if (!tree->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif /* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 7b3f292965..e587cabe13 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -26,6 +26,7 @@ SUBDIRS = \
test_parser \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index c2e5f5ffd5..c86f6bdcb0 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -20,6 +20,7 @@ subdir('test_oat_hooks')
subdir('test_parser')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..cc6970c87c
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,28 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..cb3596755d
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,504 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+/* The maximum number of entries each node type can have */
+static int rt_node_max_entries[] = {
+ 4, /* RT_NODE_KIND_4 */
+ 16, /* RT_NODE_KIND_16 */
+ 32, /* RT_NODE_KIND_32 */
+ 128, /* RT_NODE_KIND_128 */
+ 256 /* RT_NODE_KIND_256 */
+};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 10000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ radix_tree *radixtree;
+ uint64 dummy;
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ uint64 val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, key);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+ for (int j = 0; j < lengthof(rt_node_max_entries); j++)
+ {
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (i == (rt_node_max_entries[j] - 1))
+ {
+ check_search_on_node(radixtree, shift,
+ (j == 0) ? 0 : rt_node_max_entries[j - 1],
+ rt_node_max_entries[j]);
+ break;
+ }
+ }
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "inserted key 0x" UINT64_HEX_FORMAT " is not found", key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ radix_tree *radixtree;
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift);
+
+ rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+ radixtree = rt_create(radixtree_ctx);
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
--
2.31.1
v8-0003-tool-for-measuring-radix-tree-performance.patchapplication/octet-stream; name=v8-0003-tool-for-measuring-radix-tree-performance.patchDownload
From 799d4d6500bec90171c0d9ee81f55af480583323 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v8 3/4] tool for measuring radix tree performance
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 56 +++
contrib/bench_radix_tree/bench_radix_tree.c | 466 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
6 files changed, 578 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..0874201d7e
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,56 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..7abb237e96
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,466 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
--
2.31.1
On Mon, Oct 31, 2022 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
I've attached v8 patches. 0001, 0002, and 0003 patches incorporated
the comments I got so far. 0004 patch is a DSA support patch for PoC.
Thanks for the new patchset. This is not a full review, but I have some
comments:
0001 and 0002 look okay on a quick scan -- I will use this as a base for
further work that we discussed. However, before I do so I'd like to request
another revision regarding the following:
In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes
to point its children, and we use rt_node_ptr as either rt_node* or
dsa_pointer depending on whether the radix tree is shared or not (ie,
by checking radix_tree->dsa == NULL).
0004: Looks like a good start, but this patch has a large number of changes
like these, making it hard to read:
- if (found && child_p)
- *child_p = child;
+ if (found && childp_p)
+ *childp_p = childp;
...
rt_node_inner_32 *new32;
+ rt_node_ptr new32p;
/* grow node from 4 to 32 */
- new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32);
+ new32 = (rt_node_inner_32 *) node_ptr_get_local(tree, new32p);
It's difficult to keep in my head what all the variables refer to. I
thought a bit about how to split this patch up to make this easier to read.
Here's what I came up with:
typedef struct rt_node_ptr
{
uintptr_t encoded;
rt_node * decoded;
}
Note that there is nothing about "dsa or local". That's deliberate. That
way, we can use the "encoded" field for a tagged pointer as well, as I hope
we can do (at least for local pointers) in the future. So an intermediate
patch would have "static inline void" functions node_ptr_encode() and
node_ptr_decode(), which would only copy from one member to another. I
suspect that: 1. The actual DSA changes will be *much* smaller and easier
to reason about. 2. Experimenting with tagged pointers will be easier.
Also, quick question: 0004 has a new function rt_node_update_inner() -- is
that necessary because of DSA?, or does this ideally belong in 0002? What's
the reason for it?
Regarding the performance, I've
added another boolean argument to bench_seq/shuffle_search(),
specifying whether to use the shared radix tree or not. Here are
benchmark results in my environment,
[...]
In non-shared radix tree cases (the forth argument is false), I don't
see a visible performance degradation. On the other hand, in shared
radix tree cases (the forth argument is true), I see visible overheads
because of dsa_get_address().
Thanks, this is useful.
Please note that the current shared radix tree implementation doesn't
support any locking, so it cannot be read while written by someone.
I think at the very least we need a global lock to enforce this.
Also, only one process can iterate over the shared radix tree. When it
comes to parallel vacuum, these don't become restriction as the leader
process writes the radix tree while scanning heap and the radix tree
is read by multiple processes while vacuuming indexes. And only the
leader process can do heap vacuum by iterating the key-value pairs in
the radix tree. If we want to use it for other cases too, we would
need to support locking, RCU or something.
A useful exercise here is to think about what we'd need to do parallel heap
pruning. We don't need to go that far for v16 of course, but what's the
simplest thing we can do to make that possible? Other use cases can change
to more sophisticated schemes if need be.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Nov 3, 2022 at 1:59 PM John Naylor <john.naylor@enterprisedb.com> wrote:
On Mon, Oct 31, 2022 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've attached v8 patches. 0001, 0002, and 0003 patches incorporated
the comments I got so far. 0004 patch is a DSA support patch for PoC.Thanks for the new patchset. This is not a full review, but I have some comments:
0001 and 0002 look okay on a quick scan -- I will use this as a base for further work that we discussed. However, before I do so I'd like to request another revision regarding the following:
In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes
to point its children, and we use rt_node_ptr as either rt_node* or
dsa_pointer depending on whether the radix tree is shared or not (ie,
by checking radix_tree->dsa == NULL).
Thank you for the comments!
0004: Looks like a good start, but this patch has a large number of changes like these, making it hard to read:
- if (found && child_p) - *child_p = child; + if (found && childp_p) + *childp_p = childp; ... rt_node_inner_32 *new32; + rt_node_ptr new32p;/* grow node from 4 to 32 */ - new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4, - RT_NODE_KIND_32); + new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32); + new32 = (rt_node_inner_32 *) node_ptr_get_local(tree, new32p);It's difficult to keep in my head what all the variables refer to. I thought a bit about how to split this patch up to make this easier to read. Here's what I came up with:
typedef struct rt_node_ptr
{
uintptr_t encoded;
rt_node * decoded;
}Note that there is nothing about "dsa or local". That's deliberate. That way, we can use the "encoded" field for a tagged pointer as well, as I hope we can do (at least for local pointers) in the future. So an intermediate patch would have "static inline void" functions node_ptr_encode() and node_ptr_decode(), which would only copy from one member to another. I suspect that: 1. The actual DSA changes will be *much* smaller and easier to reason about. 2. Experimenting with tagged pointers will be easier.
Good idea. Will try in the next version patch.
Also, quick question: 0004 has a new function rt_node_update_inner() -- is that necessary because of DSA?, or does this ideally belong in 0002? What's the reason for it?
Oh, this was needed once when initially I'm writing DSA support but
thinking about it again now I think we can remove it and use
rt_node_insert_inner() with parent = NULL instead.
Regarding the performance, I've
added another boolean argument to bench_seq/shuffle_search(),
specifying whether to use the shared radix tree or not. Here are
benchmark results in my environment,[...]
In non-shared radix tree cases (the forth argument is false), I don't
see a visible performance degradation. On the other hand, in shared
radix tree cases (the forth argument is true), I see visible overheads
because of dsa_get_address().Thanks, this is useful.
Please note that the current shared radix tree implementation doesn't
support any locking, so it cannot be read while written by someone.I think at the very least we need a global lock to enforce this.
Also, only one process can iterate over the shared radix tree. When it
comes to parallel vacuum, these don't become restriction as the leader
process writes the radix tree while scanning heap and the radix tree
is read by multiple processes while vacuuming indexes. And only the
leader process can do heap vacuum by iterating the key-value pairs in
the radix tree. If we want to use it for other cases too, we would
need to support locking, RCU or something.A useful exercise here is to think about what we'd need to do parallel heap pruning. We don't need to go that far for v16 of course, but what's the simplest thing we can do to make that possible? Other use cases can change to more sophisticated schemes if need be.
For parallel heap pruning, multiple workers will insert key-value
pairs to the radix tree concurrently. The simplest solution would be a
single lock to protect writes but the performance will not be good.
Another solution would be that we can divide the tables into multiple
ranges so that keys derived from TIDs are not conflicted with each
other and have parallel workers process one or more ranges. That way,
parallel vacuum workers can build *sub-trees* and the leader process
can merge them. In use cases of lazy vacuum, since the write phase and
read phase are separated the readers don't need to worry about
concurrent updates.
I've attached a draft patch for lazy vacuum integration that can be
applied on top of v8 patches. The patch adds a new module called
TIDStore, an efficient storage for TID backed by radix tree. Lazy
vacuum and parallel vacuum use it instead of a TID array. The patch
also introduces rt_detach() that was missed in 0002 patch. It's a very
rough patch but I hope it helps in considering lazy vacuum
integration, radix tree APIs, and shared radix tree functionality.
There are some TODOs:
* We need to reset the TIDStore and therefore reset the radix tree. It
can easily be done by using MemoryContextReset() in non-shared radix
tree cases, but in shared case, we need either to free all radix tree
nodes recursively or introduce a way to release all allocated DSA
memory.
* We need to limit the size of TIDStore (mainly radix_tree) in
maintenance_work_mem.
* We need to change the counter-based information in
pg_stat_progress_vacuum such as max_dead_tuples and num_dead_tuplesn.
I think it would be better to show maximum bytes we can collect TIDs
and its usage instead.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v8-0005-PoC-lazy-vacuum-integration.patchapplication/octet-stream; name=v8-0005-PoC-lazy-vacuum-integration.patchDownload
From 315483e86611f485136efc6a6f141dd0caf3691c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 4 Nov 2022 14:14:42 +0900
Subject: [PATCH v8 5/5] PoC: lazy vacuum integration.
The patch includes:
* Introducing a new module called TIDStore
* Lazy vacuum and parallel vacuum integration.
TODOs:
* radix tree needs to have the reset funtionality.
* should not allow TIDStore to grow beyond the memory limit.
* change the progress statistics of pg_stat_progress_vacuum.
---
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 273 ++++++++++++++++++++++++++
src/backend/access/heap/vacuumlazy.c | 160 +++++----------
src/backend/commands/vacuum.c | 45 +----
src/backend/commands/vacuumparallel.c | 59 +++---
src/backend/lib/radixtree.c | 9 +
src/backend/storage/lmgr/lwlock.c | 2 +
src/include/access/tidstore.h | 55 ++++++
src/include/commands/vacuum.h | 24 +--
src/include/lib/radixtree.h | 1 +
src/include/storage/lwlock.h | 1 +
12 files changed, 436 insertions(+), 195 deletions(-)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index 857beaa32d..76265974b1 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -13,6 +13,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..8793c87fab
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,273 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * TID (ItemPointer) storage implementation.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "lib/radixtree.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* XXX: should be configurable for non-heap AMs */
+#define TIDSTORE_OFFSET_NBITS 11 /* pg_ceil_log2_32(MaxHeapTuplesPerPage) */
+
+#define TIDSTORE_VALUE_NBITS 6 /* log(sizeof(uint64) * BITS_PER_BYTE, 2) */
+
+/* Get block number from the key */
+#define KEY_GET_BLKNO(key) \
+ ((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+struct TIDStore
+{
+ /* main storage for TID */
+ radix_tree *tree;
+
+ /* # of tids in TIDStore */
+ int num_tids;
+
+ /* DSA area and handle for shared TIDStore */
+ dsa_pointer handle;
+ dsa_area *dsa;
+};
+
+static void tidstore_iter_collect_tids(TIDStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+TIDStore *
+tidstore_create(dsa_area *dsa)
+{
+ TIDStore *ts;
+
+ ts = palloc0(sizeof(TIDStore));
+
+ ts->tree = rt_create(CurrentMemoryContext, dsa);
+ ts->dsa = dsa;
+
+ if (dsa != NULL)
+ ts->handle = rt_get_dsa_pointer(ts->tree);
+
+ return ts;
+}
+
+/* Attach the shared TIDStore */
+TIDStore *
+tidstore_attach(dsa_area *dsa, dsa_pointer handle)
+{
+ TIDStore *ts;
+
+ Assert(dsa != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ ts = palloc0(sizeof(TIDStore));
+
+ ts->tree = rt_attach(dsa, handle);
+
+ return ts;
+}
+
+void
+tidstore_detach(TIDStore *ts)
+{
+ rt_detach(ts->tree);
+}
+
+void
+tidstore_free(TIDStore *ts)
+{
+ rt_free(ts->tree);
+ pfree(ts);
+}
+
+void
+tidstore_reset(TIDStore *ts)
+{
+ if (ts->dsa != NULL)
+ {
+ /* XXX: reset shared radix tree */
+ Assert(false);
+ }
+ else
+ {
+ ts->num_tids = 0;
+
+ rt_free(ts->tree);
+ ts->tree = rt_create(CurrentMemoryContext, NULL);
+ }
+}
+
+/* Add TIDs to TIDStore */
+void
+tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 key;
+ uint64 val = 0;
+ ItemPointerData tid;
+
+ ItemPointerSetBlockNumber(&tid, blkno);
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint32 off;
+
+ ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+ key = tid_to_key_off(&tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(ts->tree, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= UINT64CONST(1) << off;
+ ts->num_tids++;
+ }
+
+ if (last_key != PG_UINT64_MAX)
+ {
+ rt_set(ts->tree, last_key, val);
+ val = 0;
+ }
+}
+
+/* Return true if the given TID is present in TIDStore */
+bool
+tidstore_lookup_tid(TIDStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val;
+ uint32 off;
+ bool found;
+
+ key = tid_to_key_off(tid, &off);
+
+ found = rt_search(ts->tree, key, &val);
+
+ if (!found)
+ return false;
+
+ return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+TIDStoreIter *
+tidstore_begin_iterate(TIDStore *ts)
+{
+ TIDStoreIter *iter;
+
+ iter = palloc0(sizeof(TIDStoreIter));
+ iter->ts = ts;
+ iter->tree_iter = rt_begin_iterate(ts->tree);
+ iter->blkno = InvalidBlockNumber;
+
+ return iter;
+}
+
+bool
+tidstore_iterate_next(TIDStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+
+ if (iter->finished)
+ return false;
+
+ if (BlockNumberIsValid(iter->blkno))
+ {
+ iter->num_offsets = 0;
+ tidstore_iter_collect_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (rt_iterate_next(iter->tree_iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = KEY_GET_BLKNO(key);
+
+ if (BlockNumberIsValid(iter->blkno) && iter->blkno != blkno)
+ {
+ /*
+ * Remember the key-value pair for the next block for the
+ * next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+ return true;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_collect_tids(iter, key, val);
+ }
+
+ iter->finished = true;
+ return true;
+}
+
+uint64
+tidstore_num_tids(TIDStore *ts)
+{
+ return ts->num_tids;
+}
+
+uint64
+tidstore_memory_usage(TIDStore *ts)
+{
+ return (uint64) sizeof(TIDStore) + rt_memory_usage(ts->tree);
+}
+
+tidstore_handle
+tidstore_get_handle(TIDStore *ts)
+{
+ return rt_get_dsa_pointer(ts->tree);
+}
+
+/* Extract TIDs from key-value pair */
+static void
+tidstore_iter_collect_tids(TIDStoreIter *iter, uint64 key, uint64 val)
+{
+ for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ if ((val & (UINT64CONST(1) << i)) == 0)
+ continue;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= i;
+
+ off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+ iter->offsets[iter->num_offsets++] = off;
+ }
+
+ iter->blkno = KEY_GET_BLKNO(key);
+}
+
+/* Encode a TID to key and val */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint64 tid_i;
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+ *off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+ upper = tid_i >> TIDSTORE_VALUE_NBITS;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ return upper;
+}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index dfbe37472f..5b013bc3a8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -144,6 +145,8 @@ typedef struct LVRelState
Relation *indrels;
int nindexes;
+ int max_bytes;
+
/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
bool aggressive;
/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -194,7 +197,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TIDStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -265,8 +268,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer *vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer *vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -397,6 +401,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
vacrel->indname = NULL;
vacrel->phase = VACUUM_ERRCB_PHASE_UNKNOWN;
vacrel->verbose = verbose;
+ vacrel->max_bytes = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
errcallback.callback = vacuum_error_callback;
errcallback.arg = vacrel;
errcallback.previous = error_context_stack;
@@ -858,7 +865,7 @@ lazy_scan_heap(LVRelState *vacrel)
next_unskippable_block,
next_failsafe_block = 0,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TIDStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
@@ -872,7 +879,7 @@ lazy_scan_heap(LVRelState *vacrel)
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = vacrel->max_bytes; /* XXX: should use # of tids */
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -942,8 +949,8 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ /* XXX: should not allow tidstore to grow beyond max_bytes */
+ if (tidstore_memory_usage(vacrel->dead_items) > vacrel->max_bytes)
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -1075,11 +1082,17 @@ lazy_scan_heap(LVRelState *vacrel)
if (prunestate.has_lpdead_items)
{
Size freespace;
+ TIDStoreIter *iter;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ tidstore_iterate_next(iter);
+ lazy_vacuum_heap_page(vacrel, blkno, iter->offsets, iter->num_offsets,
+ buf, &vmbuffer);
+ Assert(!tidstore_iterate_next(iter));
+ pfree(iter);
/* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ tidstore_reset(dead_items);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1116,7 +1129,7 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(tidstore_num_tids(dead_items) == 0);
}
/*
@@ -1269,7 +1282,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (tidstore_num_tids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1903,25 +1916,16 @@ retry:
*/
if (lpdead_items > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TIDStore *dead_items = vacrel->dead_items;
Assert(!prunestate->all_visible);
Assert(prunestate->has_lpdead_items);
vacrel->lpdead_item_pages++;
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ tidstore_num_tids(dead_items));
}
/* Finally, add page-local counts to whole-VACUUM counts */
@@ -2128,8 +2132,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TIDStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2138,17 +2141,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- Assert(dead_items->num_items <= dead_items->max_items);
pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ tidstore_num_tids(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2197,7 +2193,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
return;
}
@@ -2226,7 +2222,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2253,8 +2249,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2299,7 +2295,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ /* tidstore_reset(vacrel->dead_items); */
}
/*
@@ -2371,7 +2367,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2408,10 +2404,10 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index;
BlockNumber vacuumed_pages;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TIDStoreIter *iter;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2428,8 +2424,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuumed_pages = 0;
- index = 0;
- while (index < vacrel->dead_items->num_items)
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ while (tidstore_iterate_next(iter))
{
BlockNumber tblk;
Buffer buf;
@@ -2438,12 +2434,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- tblk = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ tblk = iter->blkno;
vacrel->blkno = tblk;
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, tblk, buf, index, &vmbuffer);
+ lazy_vacuum_heap_page(vacrel, tblk, iter->offsets, iter->num_offsets,
+ buf, &vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2467,9 +2464,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
@@ -2491,11 +2487,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* LP_DEAD item on the page. The return value is the first index immediately
* after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer *vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets, Buffer buffer, Buffer *vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int uncnt = 0;
@@ -2514,16 +2509,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = offsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2603,7 +2593,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -3105,46 +3094,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
return vacrel->nonempty_pages;
}
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
/*
* Allocate dead_items (either using palloc, or in dynamic shared memory).
* Sets dead_items in vacrel for caller.
@@ -3155,12 +3104,6 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
-
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
* be used for an index, so we invoke parallelism only if there are at
@@ -3186,7 +3129,6 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3199,11 +3141,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = tidstore_create(NULL);
}
/*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7ccde07de9..03ce9c3b6e 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2295,16 +2295,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TIDStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ tidstore_num_tids(dead_items))));
return istat;
}
@@ -2335,18 +2335,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
@@ -2357,32 +2345,9 @@ vac_max_items_to_alloc_size(int max_items)
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch((void *) itemptr,
- (void *) dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
+ TIDStore *dead_items = (TIDStore *) state;
- return (res != NULL);
+ return tidstore_lookup_tid(dead_items, itemptr);
}
/*
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index f26d796e52..641c98d80b 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
* use small integers.
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 3
#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 4
#define PARALLEL_VACUUM_KEY_WAL_USAGE 5
@@ -103,6 +103,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TIDStore */
+ tidstore_handle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,7 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -222,20 +225,22 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -283,9 +288,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -351,6 +355,15 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = tidstore_create(dead_items_dsa);
+ pvs->dead_items = dead_items;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -360,6 +373,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = tidstore_get_handle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +382,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -434,6 +439,8 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ tidstore_free(pvs->dead_items);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -442,7 +449,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TIDStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -940,7 +947,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -984,10 +993,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1042,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ tidstore_detach(pvs.dead_items);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 3b06f22af5..a428046d71 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -1731,6 +1731,7 @@ rt_attach(dsa_area *dsa, dsa_pointer dp)
tree = (radix_tree *) palloc0(sizeof(radix_tree));
tree->ctl_dp = dp;
+ tree->dsa = dsa;
tree->ctl = (radix_tree_control *) dsa_get_address(dsa, dp);
/* XXX: do we need to set a callback on exit to detach dsa? */
@@ -1738,6 +1739,14 @@ rt_attach(dsa_area *dsa, dsa_pointer dp)
return tree;
}
+void
+rt_detach(radix_tree *tree)
+{
+ Assert(RadixTreeIsShared(tree));
+ dsa_detach(tree->dsa);
+ pfree(tree);
+}
+
/*
* Free the given radix tree.
*/
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 0fc0cf6ebb..f94608f45a 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -183,6 +183,8 @@ static const char *const BuiltinTrancheNames[] = {
"PgStatsHash",
/* LWTRANCHE_PGSTATS_DATA: */
"PgStatsData",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..40b8021f9b
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,55 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * TID storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "lib/radixtree.h"
+#include "storage/itemptr.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TIDStore TIDStore;
+
+typedef struct TIDStoreIter
+{
+ TIDStore *ts;
+
+ rt_iter *tree_iter;
+
+ bool finished;
+
+ uint64 next_key;
+ uint64 next_val;
+
+ BlockNumber blkno;
+ OffsetNumber offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+ int num_offsets;
+} TIDStoreIter;
+
+extern TIDStore *tidstore_create(dsa_area *dsa);
+extern TIDStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TIDStore *ts);
+extern void tidstore_free(TIDStore *ts);
+extern void tidstore_reset(TIDStore *ts);
+extern void tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TIDStore *ts, ItemPointer tid);
+extern TIDStoreIter * tidstore_begin_iterate(TIDStore *ts);
+extern bool tidstore_iterate_next(TIDStoreIter *iter);
+extern uint64 tidstore_num_tids(TIDStore *ts);
+extern uint64 tidstore_memory_usage(TIDStore *ts);
+extern tidstore_handle tidstore_get_handle(TIDStore *ts);
+
+#endif /* TIDSTORE_H */
+
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 5d816ba7f4..d221528f16 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -235,21 +236,6 @@ typedef struct VacuumParams
int nworkers;
} VacuumParams;
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -306,18 +292,16 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TIDStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TIDStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d9d8355c21..e3f90adebd 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -29,6 +29,7 @@ extern rt_iter *rt_begin_iterate(radix_tree *tree);
extern dsa_pointer rt_get_dsa_pointer(radix_tree *tree);
extern radix_tree *rt_attach(dsa_area *dsa, dsa_pointer dp);
+extern void rt_detach(radix_tree *tree);
extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
extern void rt_end_iterate(rt_iter *iter);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ca4eca76f4..0999e4fc10 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -193,6 +193,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DSA,
LWTRANCHE_PGSTATS_HASH,
LWTRANCHE_PGSTATS_DATA,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
--
2.31.1
On Fri, Nov 4, 2022 at 10:25 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
For parallel heap pruning, multiple workers will insert key-value
pairs to the radix tree concurrently. The simplest solution would be a
single lock to protect writes but the performance will not be good.
Another solution would be that we can divide the tables into multiple
ranges so that keys derived from TIDs are not conflicted with each
other and have parallel workers process one or more ranges. That way,
parallel vacuum workers can build *sub-trees* and the leader process
can merge them. In use cases of lazy vacuum, since the write phase and
read phase are separated the readers don't need to worry about
concurrent updates.
It's a good idea to use ranges for a different reason -- readahead. See
commit 56788d2156fc3, which aimed to improve readahead for sequential
scans. It might work to use that as a model: Each worker prunes a range of
64 pages, keeping the dead tids in a local array. At the end of the range:
lock the tid store, enter the tids into the store, unlock, free the local
array, and get the next range from the leader. It's possible contention
won't be too bad, and I suspect using small local arrays as-we-go would be
faster and use less memory than merging multiple sub-trees at the end.
I've attached a draft patch for lazy vacuum integration that can be
applied on top of v8 patches. The patch adds a new module called
TIDStore, an efficient storage for TID backed by radix tree. Lazy
vacuum and parallel vacuum use it instead of a TID array. The patch
also introduces rt_detach() that was missed in 0002 patch. It's a very
rough patch but I hope it helps in considering lazy vacuum
integration, radix tree APIs, and shared radix tree functionality.
It does help, good to see this.
--
John Naylor
EDB: http://www.enterprisedb.com
On Sat, Nov 5, 2022 at 6:23 PM John Naylor <john.naylor@enterprisedb.com> wrote:
On Fri, Nov 4, 2022 at 10:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
For parallel heap pruning, multiple workers will insert key-value
pairs to the radix tree concurrently. The simplest solution would be a
single lock to protect writes but the performance will not be good.
Another solution would be that we can divide the tables into multiple
ranges so that keys derived from TIDs are not conflicted with each
other and have parallel workers process one or more ranges. That way,
parallel vacuum workers can build *sub-trees* and the leader process
can merge them. In use cases of lazy vacuum, since the write phase and
read phase are separated the readers don't need to worry about
concurrent updates.It's a good idea to use ranges for a different reason -- readahead. See commit 56788d2156fc3, which aimed to improve readahead for sequential scans. It might work to use that as a model: Each worker prunes a range of 64 pages, keeping the dead tids in a local array. At the end of the range: lock the tid store, enter the tids into the store, unlock, free the local array, and get the next range from the leader. It's possible contention won't be too bad, and I suspect using small local arrays as-we-go would be faster and use less memory than merging multiple sub-trees at the end.
Seems a promising idea. I think it might work well even in the current
parallel vacuum (ie., single writer). I mean, I think we can have a
single lwlock for shared cases in the first version. If the overhead
of acquiring the lwlock per insertion of key-value is not negligible,
we might want to try this idea.
Apart from that, I'm going to incorporate the comments on 0004 patch
and try a pointer tagging.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Fri, Nov 4, 2022 at 8:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
For parallel heap pruning, multiple workers will insert key-value
pairs to the radix tree concurrently. The simplest solution would be a
single lock to protect writes but the performance will not be good.
Another solution would be that we can divide the tables into multiple
ranges so that keys derived from TIDs are not conflicted with each
other and have parallel workers process one or more ranges. That way,
parallel vacuum workers can build *sub-trees* and the leader process
can merge them. In use cases of lazy vacuum, since the write phase and
read phase are separated the readers don't need to worry about
concurrent updates.
I think that the VM snapshot concept can eventually be used to
implement parallel heap pruning. Since every page that will become a
scanned_pages is known right from the start with VM snapshots, it will
be relatively straightforward to partition these pages into distinct
ranges with an equal number of pages, one per worker planned. The VM
snapshot structure can also be used for I/O prefetching, which will be
more important with parallel heap pruning (and with aio).
Working off of an immutable structure that describes which pages to
process right from the start is naturally easy to work with, in
general. We can "reorder work" flexibly (i.e. process individual
scanned_pages in any order that is convenient). Another example is
"changing our mind" about advancing relfrozenxid when it turns out
that we maybe should have decided to do that at the start of VACUUM
[1]: /messages/by-id/CAH2-WzkQ86yf==mgAF=cQ0qeLRWKX3htLw9Qo+qx3zbwJJkPiQ@mail.gmail.com -- Peter Geoghegan
be a very useful idea, but it is at least an interesting and thought
provoking concept.
[1]: /messages/by-id/CAH2-WzkQ86yf==mgAF=cQ0qeLRWKX3htLw9Qo+qx3zbwJJkPiQ@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan
On Tue, Nov 8, 2022 at 11:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sat, Nov 5, 2022 at 6:23 PM John Naylor <john.naylor@enterprisedb.com> wrote:
On Fri, Nov 4, 2022 at 10:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
For parallel heap pruning, multiple workers will insert key-value
pairs to the radix tree concurrently. The simplest solution would be a
single lock to protect writes but the performance will not be good.
Another solution would be that we can divide the tables into multiple
ranges so that keys derived from TIDs are not conflicted with each
other and have parallel workers process one or more ranges. That way,
parallel vacuum workers can build *sub-trees* and the leader process
can merge them. In use cases of lazy vacuum, since the write phase and
read phase are separated the readers don't need to worry about
concurrent updates.It's a good idea to use ranges for a different reason -- readahead. See commit 56788d2156fc3, which aimed to improve readahead for sequential scans. It might work to use that as a model: Each worker prunes a range of 64 pages, keeping the dead tids in a local array. At the end of the range: lock the tid store, enter the tids into the store, unlock, free the local array, and get the next range from the leader. It's possible contention won't be too bad, and I suspect using small local arrays as-we-go would be faster and use less memory than merging multiple sub-trees at the end.
Seems a promising idea. I think it might work well even in the current
parallel vacuum (ie., single writer). I mean, I think we can have a
single lwlock for shared cases in the first version. If the overhead
of acquiring the lwlock per insertion of key-value is not negligible,
we might want to try this idea.Apart from that, I'm going to incorporate the comments on 0004 patch
and try a pointer tagging.
I'd like to share some progress on this work.
0004 patch is a new patch supporting a pointer tagging of the node
kind. Also, it introduces rt_node_ptr we discussed so that internal
functions use it rather than having two arguments for encoded and
decoded pointers. With this intermediate patch, the DSA support patch
became more readable and understandable. Probably we can make it
smaller further if we move the change of separating the control object
from radix_tree to the main patch (0002). The patch still needs to be
polished but I'd like to check if this idea is worthwhile. If we agree
on this direction, this patch will be merged into the main radix tree
implementation patch.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v9-0003-tool-for-measuring-radix-tree-performance.patchapplication/octet-stream; name=v9-0003-tool-for-measuring-radix-tree-performance.patchDownload
From b5950f71e476f3621e46ec7da0f8f9f7a452a685 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v9 3/6] tool for measuring radix tree performance
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 56 +++
contrib/bench_radix_tree/bench_radix_tree.c | 466 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
6 files changed, 578 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..0874201d7e
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,56 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..7abb237e96
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,466 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
--
2.31.1
v9-0004-PoC-tag-the-node-kind-to-rt_pointer.patchapplication/octet-stream; name=v9-0004-PoC-tag-the-node-kind-to-rt_pointer.patchDownload
From 624fd0577546a746f0538b98ab7456adc4ca1bd5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 14 Nov 2022 11:44:17 +0900
Subject: [PATCH v9 4/6] PoC: tag the node kind to rt_pointer.
---
src/backend/lib/radixtree.c | 660 ++++++++++++++++++++----------------
1 file changed, 375 insertions(+), 285 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index bd58b2bfad..c25d455d2a 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -126,6 +126,23 @@ typedef enum
#define RT_NODE_KIND_128 0x02
#define RT_NODE_KIND_256 0x03
#define RT_NODE_KIND_COUNT 4
+#define RT_POINTER_KIND_MASK 0x03
+
+/*
+ * rt_pointer is a tagged pointer for rt_node. It is encoded from a
+ * C-pointer (ie, local memory address) and the node kind. The node
+ * kind uses the lower 2 bits, which are always 0 in local memory address.
+ * We can encode and decode the pointer using by rt_pointer_decode()
+ * and rt_pointer_encode() functions, respectively.
+ *
+ * The inner nodes of the radix tree need to store rt_pointer rather than
+ * C-pointer for the above reason.
+ */
+typedef uintptr_t rt_pointer;
+#define InvalidRTPointer ((rt_pointer) 0)
+#define RTPointerIsValid(x) (((rt_pointer) (x)) != InvalidRTPointer)
+#define RTPointerTagKind(x, k) ((rt_pointer) (x) | ((k) & RT_POINTER_KIND_MASK))
+#define RTPointerUnTagKind(x) ((rt_pointer) (x) & ~RT_POINTER_KIND_MASK)
/* Common type for all nodes types */
typedef struct rt_node
@@ -144,13 +161,12 @@ typedef struct rt_node
uint8 shift;
uint8 chunk;
- /* Size kind of the node */
- uint8 kind;
+ /*
+ * The node kind is tagged into the rt_pointer, see the comments of
+ * rt_pointer for details.
+ */
} rt_node;
-#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
-#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
-#define NODE_HAS_FREE_SLOT(n) \
- (((rt_node *) (n))->count < rt_node_kind_info[((rt_node *) (n))->kind].fanout)
+#define RT_NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
/* Base type of each node kinds for leaf and inner nodes */
typedef struct rt_node_base_4
@@ -205,7 +221,7 @@ typedef struct rt_node_inner_4
rt_node_base_4 base;
/* 4 children, for key chunks */
- rt_node *children[4];
+ rt_pointer children[4];
} rt_node_inner_4;
typedef struct rt_node_leaf_4
@@ -221,7 +237,7 @@ typedef struct rt_node_inner_32
rt_node_base_32 base;
/* 32 children, for key chunks */
- rt_node *children[32];
+ rt_pointer children[32];
} rt_node_inner_32;
typedef struct rt_node_leaf_32
@@ -237,7 +253,7 @@ typedef struct rt_node_inner_128
rt_node_base_128 base;
/* Slots for 128 children */
- rt_node *children[128];
+ rt_pointer children[128];
} rt_node_inner_128;
typedef struct rt_node_leaf_128
@@ -260,7 +276,7 @@ typedef struct rt_node_inner_256
rt_node_base_256 base;
/* Slots for 256 children */
- rt_node *children[RT_NODE_MAX_SLOTS];
+ rt_pointer children[RT_NODE_MAX_SLOTS];
} rt_node_inner_256;
typedef struct rt_node_leaf_256
@@ -274,6 +290,30 @@ typedef struct rt_node_leaf_256
uint64 values[RT_NODE_MAX_SLOTS];
} rt_node_leaf_256;
+/*
+ * rt_node_ptr is an useful data structure representing a pointer for a rt_node.
+ */
+typedef struct rt_node_ptr
+{
+ rt_pointer encoded;
+ rt_node *decoded;
+} rt_node_ptr;
+#define InvalidRTNodePtr \
+ (rt_node_ptr) {.encoded = InvalidRTPointer, .decoded = NULL }
+#define RTNodePtrIsValid(n) \
+ (!rt_node_ptr_eq((rt_node_ptr *) &(n), &(InvalidRTNodePtr)))
+
+/* Macros for rt_node_ptr to access the fields of rt_node */
+#define NODE_RAW(n) (((rt_node_ptr) (n)).decoded)
+#define NODE_IS_LEAF(n) (NODE_RAW(n)->shift == 0)
+#define NODE_IS_EMPTY(n) (NODE_COUNT(n) == 0)
+#define NODE_KIND(n) ((uint8) (((rt_node_ptr) (n)).encoded & RT_POINTER_KIND_MASK))
+#define NODE_COUNT(n) (NODE_RAW(n)->count)
+#define NODE_SHIFT(n) (NODE_RAW(n)->shift)
+#define NODE_CHUNK(n) (NODE_RAW(n)->chunk)
+#define NODE_HAS_FREE_SLOT(n) \
+ (NODE_COUNT(n) < rt_node_kind_info[NODE_KIND(n)].fanout)
+
/* Information of each size kinds */
typedef struct rt_node_kind_info_elem
{
@@ -347,7 +387,7 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
*/
typedef struct rt_node_iter
{
- rt_node *node; /* current node being iterated */
+ rt_node_ptr node; /* current node being iterated */
int current_idx; /* current position. -1 for initial value */
} rt_node_iter;
@@ -368,7 +408,7 @@ struct radix_tree
{
MemoryContext context;
- rt_node *root;
+ rt_pointer root;
uint64 max_val;
uint64 num_keys;
@@ -382,26 +422,56 @@ struct radix_tree
};
static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node *rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
- bool inner);
-static void rt_free_node(radix_tree *tree, rt_node *node);
+static rt_node_ptr rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
+ bool inner);
+static void rt_free_node(radix_tree *tree, rt_node_ptr node);
static void rt_extend(radix_tree *tree, uint64 key);
-static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
- rt_node **child_p);
-static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+static inline bool rt_node_search_inner(rt_node_ptr node_ptr, uint64 key, rt_action action,
+ rt_pointer *child_p);
+static inline bool rt_node_search_leaf(rt_node_ptr node_ptr, uint64 key, rt_action action,
uint64 *value_p);
-static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
- uint64 key, rt_node *child);
-static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+static bool rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+ uint64 key, rt_node_ptr child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
uint64 key, uint64 value);
-static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ rt_node_ptr *child_p);
static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
uint64 *value_p);
-static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static void rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from);
static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
/* verification (available only with assertion) */
-static void rt_verify_node(rt_node *node);
+static void rt_verify_node(rt_node_ptr node);
+
+/* Decode and encode function of rt_pointer */
+static inline rt_node *
+rt_pointer_decode(rt_pointer encoded)
+{
+ return (rt_node *) RTPointerUnTagKind(encoded);
+}
+
+static inline rt_pointer
+rt_pointer_encode(rt_node *decoded, uint8 kind)
+{
+ return (rt_pointer) RTPointerTagKind(decoded, kind);
+}
+
+/* Return a rt_pointer created from the given encoded pointer */
+static inline rt_node_ptr
+rt_node_ptr_encoded(rt_pointer encoded)
+{
+ return (rt_node_ptr) {
+ .encoded = encoded,
+ .decoded = rt_pointer_decode(encoded)
+ };
+}
+
+static inline bool
+rt_node_ptr_eq(rt_node_ptr *a, rt_node_ptr *b)
+{
+ return (a->decoded == b->decoded) && (a->encoded == b->encoded);
+}
/*
* Return index of the first element in 'base' that equals 'key'. Return -1
@@ -550,10 +620,10 @@ node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
/* Shift the elements right at 'idx' by one */
static inline void
-chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_shift(uint8 *chunks, rt_pointer *children, int count, int idx)
{
memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
- memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_pointer) * (count - idx));
}
static inline void
@@ -565,10 +635,10 @@ chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
/* Delete the element at 'idx' */
static inline void
-chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_delete(uint8 *chunks, rt_pointer *children, int count, int idx)
{
memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
- memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_pointer) * (count - idx - 1));
}
static inline void
@@ -580,15 +650,15 @@ chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
/* Copy both chunks and children/values arrays */
static inline void
-chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
- uint8 *dst_chunks, rt_node **dst_children, int count)
+chunk_children_array_copy(uint8 *src_chunks, rt_pointer *src_children,
+ uint8 *dst_chunks, rt_pointer *dst_children, int count)
{
/* For better code generation */
if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
pg_unreachable();
memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
- memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+ memcpy(dst_children, src_children, sizeof(rt_pointer) * count);
}
static inline void
@@ -616,28 +686,28 @@ node_128_is_chunk_used(rt_node_base_128 *node, uint8 chunk)
static inline bool
node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
{
- Assert(!NODE_IS_LEAF(node));
- return (node->children[slot] != NULL);
+ Assert(!RT_NODE_IS_LEAF(node));
+ return RTPointerIsValid(node->children[slot]);
}
static inline bool
node_leaf_128_is_slot_used(rt_node_leaf_128 *node, uint8 slot)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
}
-static inline rt_node *
+static inline rt_pointer
node_inner_128_get_child(rt_node_inner_128 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
return node->children[node->base.slot_idxs[chunk]];
}
static inline uint64
node_leaf_128_get_value(rt_node_leaf_128 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
Assert(((rt_node_base_128 *) node)->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX);
return node->values[node->base.slot_idxs[chunk]];
}
@@ -645,7 +715,7 @@ node_leaf_128_get_value(rt_node_leaf_128 *node, uint8 chunk)
static void
node_inner_128_delete(rt_node_inner_128 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
}
@@ -654,7 +724,7 @@ node_leaf_128_delete(rt_node_leaf_128 *node, uint8 chunk)
{
int slotpos = node->base.slot_idxs[chunk];
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
}
@@ -665,7 +735,7 @@ node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
{
int slotpos = 0;
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
while (node_inner_128_is_slot_used(node, slotpos))
slotpos++;
@@ -677,7 +747,7 @@ node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
{
int slotpos;
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
/* We iterate over the isset bitmap per byte then check each bit */
for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
@@ -695,11 +765,11 @@ node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
}
static inline void
-node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_pointer child)
{
int slotpos;
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
/* find unused slot */
slotpos = node_inner_128_find_unused_slot(node, chunk);
@@ -714,7 +784,7 @@ node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
{
int slotpos;
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
/* find unused slot */
slotpos = node_leaf_128_find_unused_slot(node, chunk);
@@ -726,16 +796,16 @@ node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
/* Update the child corresponding to 'chunk' to 'child' */
static inline void
-node_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+node_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_pointer child)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->children[node->base.slot_idxs[chunk]] = child;
}
static inline void
node_leaf_128_update(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->values[node->base.slot_idxs[chunk]] = value;
}
@@ -745,21 +815,21 @@ node_leaf_128_update(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
static inline bool
node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
- return (node->children[chunk] != NULL);
+ Assert(!RT_NODE_IS_LEAF(node));
+ return RTPointerIsValid(node->children[chunk]);
}
static inline bool
node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
}
-static inline rt_node *
+static inline rt_pointer
node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
Assert(node_inner_256_is_chunk_used(node, chunk));
return node->children[chunk];
}
@@ -767,16 +837,16 @@ node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
static inline uint64
node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
Assert(node_leaf_256_is_chunk_used(node, chunk));
return node->values[chunk];
}
/* Set the child in the node-256 */
static inline void
-node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_pointer child)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->children[chunk] = child;
}
@@ -784,7 +854,7 @@ node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
static inline void
node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
node->values[chunk] = value;
}
@@ -793,14 +863,14 @@ node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
static inline void
node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
- node->children[chunk] = NULL;
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = InvalidRTPointer;
}
static inline void
node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
}
@@ -835,37 +905,36 @@ static void
rt_new_root(radix_tree *tree, uint64 key)
{
int shift = key_get_shift(key);
- rt_node *node;
+ rt_node_ptr node;
- node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0,
- shift > 0);
+ node = rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0, shift > 0);
tree->max_val = shift_get_max_val(shift);
- tree->root = node;
+ tree->root = node.encoded;
}
/*
* Allocate a new node with the given node kind.
*/
-static rt_node *
+static rt_node_ptr
rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
{
- rt_node *newnode;
+ rt_node_ptr newnode;
if (inner)
- newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
- rt_node_kind_info[kind].inner_size);
+ newnode.decoded = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_kind_info[kind].inner_size);
else
- newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
- rt_node_kind_info[kind].leaf_size);
+ newnode.decoded = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_kind_info[kind].leaf_size);
- newnode->kind = kind;
- newnode->shift = shift;
- newnode->chunk = chunk;
+ newnode.encoded = rt_pointer_encode(newnode.decoded, kind);
+ NODE_SHIFT(newnode) = shift;
+ NODE_CHUNK(newnode) = chunk;
/* Initialize slot_idxs to invalid values */
if (kind == RT_NODE_KIND_128)
{
- rt_node_base_128 *n128 = (rt_node_base_128 *) newnode;
+ rt_node_base_128 *n128 = (rt_node_base_128 *) newnode.decoded;
memset(n128->slot_idxs, RT_NODE_128_INVALID_IDX, sizeof(n128->slot_idxs));
}
@@ -882,55 +951,56 @@ rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
* Create a new node with 'new_kind' and the same shift, chunk, and
* count of 'node'.
*/
-static rt_node *
-rt_copy_node(radix_tree *tree, rt_node *node, int new_kind)
+static rt_node_ptr
+rt_copy_node(radix_tree *tree, rt_node_ptr node, int new_kind)
{
- rt_node *newnode;
+ rt_node_ptr newnode;
+ rt_node *n = node.decoded;
- newnode = rt_alloc_node(tree, new_kind, node->shift, node->chunk,
- node->shift > 0);
- newnode->count = node->count;
+ newnode = rt_alloc_node(tree, new_kind, n->shift, n->chunk, n->shift > 0);
+ NODE_COUNT(newnode) = NODE_COUNT(node);
return newnode;
}
/* Free the given node */
static void
-rt_free_node(radix_tree *tree, rt_node *node)
+rt_free_node(radix_tree *tree, rt_node_ptr node)
{
/* If we're deleting the root node, make the tree empty */
- if (tree->root == node)
- tree->root = NULL;
+ if (tree->root == node.encoded)
+ tree->root = InvalidRTPointer;
#ifdef RT_DEBUG
/* update the statistics */
- tree->cnt[node->kind]--;
- Assert(tree->cnt[node->kind] >= 0);
+ tree->cnt[NODE_KIND(node)]--;
+ Assert(tree->cnt[NODE_KIND(node)] >= 0);
#endif
- pfree(node);
+ pfree(node.decoded);
}
/*
* Replace old_child with new_child, and free the old one.
*/
static void
-rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
- rt_node *new_child, uint64 key)
+rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
+ rt_node_ptr new_child, uint64 key)
{
- Assert(old_child->chunk == new_child->chunk);
- Assert(old_child->shift == new_child->shift);
+ Assert(NODE_CHUNK(old_child) == NODE_CHUNK(new_child));
+ Assert(NODE_SHIFT(old_child) == NODE_SHIFT(new_child));
- if (parent == old_child)
+ if (rt_node_ptr_eq(&parent, &old_child))
{
/* Replace the root node with the new large node */
- tree->root = new_child;
+ tree->root = new_child.encoded;
}
else
{
bool replaced PG_USED_FOR_ASSERTS_ONLY;
- replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+ replaced = rt_node_insert_inner(tree, InvalidRTNodePtr, parent, key,
+ new_child);
Assert(replaced);
}
@@ -945,23 +1015,26 @@ static void
rt_extend(radix_tree *tree, uint64 key)
{
int target_shift;
- int shift = tree->root->shift + RT_NODE_SPAN;
+ rt_node *root = rt_pointer_decode(tree->root);
+ int shift = root->shift + RT_NODE_SPAN;
target_shift = key_get_shift(key);
/* Grow tree from 'shift' to 'target_shift' */
while (shift <= target_shift)
{
- rt_node_inner_4 *node;
+ rt_node_ptr node;
+ rt_node_inner_4 *n4;
+
+ node = rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0, true);
+ n4 = (rt_node_inner_4 *) node.decoded;
- node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4,
- shift, 0, true);
- node->base.n.count = 1;
- node->base.chunks[0] = 0;
- node->children[0] = tree->root;
+ n4->base.n.count = 1;
+ n4->base.chunks[0] = 0;
+ n4->children[0] = tree->root;
- tree->root->chunk = 0;
- tree->root = (rt_node *) node;
+ root->chunk = 0;
+ tree->root = node.encoded;
shift += RT_NODE_SPAN;
}
@@ -974,18 +1047,18 @@ rt_extend(radix_tree *tree, uint64 key)
* Insert inner and leaf nodes from 'node' to bottom.
*/
static inline void
-rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
- rt_node *node)
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
+ rt_node_ptr node)
{
- int shift = node->shift;
+ int shift = NODE_SHIFT(node);
while (shift >= RT_NODE_SPAN)
{
- rt_node *newchild;
+ rt_node_ptr newchild;
int newshift = shift - RT_NODE_SPAN;
newchild = rt_alloc_node(tree, RT_NODE_KIND_4, newshift,
- RT_GET_KEY_CHUNK(key, node->shift),
+ RT_GET_KEY_CHUNK(key, NODE_SHIFT(node)),
newshift > 0);
rt_node_insert_inner(tree, parent, node, key, newchild);
@@ -1006,17 +1079,18 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
* pointer is set to child_p.
*/
static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
+ rt_pointer *child_p)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool found = false;
- rt_node *child = NULL;
+ rt_pointer child;
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
if (idx < 0)
@@ -1034,7 +1108,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
}
case RT_NODE_KIND_32:
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
if (idx < 0)
@@ -1050,7 +1124,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
}
case RT_NODE_KIND_128:
{
- rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node.decoded;
if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
break;
@@ -1066,7 +1140,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
}
case RT_NODE_KIND_256:
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
if (!node_inner_256_is_chunk_used(n256, chunk))
break;
@@ -1083,7 +1157,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
/* update statistics */
if (action == RT_ACTION_DELETE && found)
- node->count--;
+ NODE_COUNT(node)--;
if (found && child_p)
*child_p = child;
@@ -1099,17 +1173,17 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
* to the value is set to value_p.
*/
static inline bool
-rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+rt_node_search_leaf(rt_node_ptr node, uint64 key, rt_action action, uint64 *value_p)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool found = false;
uint64 value = 0;
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
if (idx < 0)
@@ -1127,7 +1201,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
}
case RT_NODE_KIND_32:
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
if (idx < 0)
@@ -1143,7 +1217,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
}
case RT_NODE_KIND_128:
{
- rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node.decoded;
if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
break;
@@ -1159,7 +1233,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
}
case RT_NODE_KIND_256:
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
if (!node_leaf_256_is_chunk_used(n256, chunk))
break;
@@ -1176,7 +1250,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
/* update statistics */
if (action == RT_ACTION_DELETE && found)
- node->count--;
+ NODE_COUNT(node)--;
if (found && value_p)
*value_p = value;
@@ -1186,19 +1260,19 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
/* Insert the child to the inner node */
static bool
-rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
- rt_node *child)
+rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+ uint64 key, rt_node_ptr child)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool chunk_exists = false;
Assert(!NODE_IS_LEAF(node));
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
int idx;
idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1206,25 +1280,26 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
{
/* found the existing chunk */
chunk_exists = true;
- n4->children[idx] = child;
+ n4->children[idx] = child.encoded;
break;
}
- if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+ if (unlikely(!NODE_HAS_FREE_SLOT(node)))
{
+ rt_node_ptr new;
rt_node_inner_32 *new32;
/* grow node from 4 to 32 */
- new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ new = rt_copy_node(tree, node, RT_NODE_KIND_32);
+ new32 = (rt_node_inner_32 *) new.decoded;
+
chunk_children_array_copy(n4->base.chunks, n4->children,
new32->base.chunks, new32->children,
n4->base.n.count);
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
- key);
- node = (rt_node *) new32;
+ Assert(RTNodePtrIsValid(parent));
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1237,14 +1312,14 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
count, insertpos);
n4->base.chunks[insertpos] = chunk;
- n4->children[insertpos] = child;
+ n4->children[insertpos] = child.encoded;
break;
}
}
/* FALLTHROUGH */
case RT_NODE_KIND_32:
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
int idx;
idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1252,24 +1327,25 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
{
/* found the existing chunk */
chunk_exists = true;
- n32->children[idx] = child;
+ n32->children[idx] = child.encoded;
break;
}
- if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+ if (unlikely(!NODE_HAS_FREE_SLOT(node)))
{
+ rt_node_ptr new;
rt_node_inner_128 *new128;
/* grow node from 32 to 128 */
- new128 = (rt_node_inner_128 *) rt_copy_node(tree, (rt_node *) n32,
- RT_NODE_KIND_128);
+ new = rt_copy_node(tree, node, RT_NODE_KIND_128);
+ new128 = (rt_node_inner_128 *) new.decoded;
+
for (int i = 0; i < n32->base.n.count; i++)
node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
- key);
- node = (rt_node *) new128;
+ Assert(RTNodePtrIsValid(parent));
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1281,31 +1357,33 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
count, insertpos);
n32->base.chunks[insertpos] = chunk;
- n32->children[insertpos] = child;
+ n32->children[insertpos] = child.encoded;
break;
}
}
/* FALLTHROUGH */
case RT_NODE_KIND_128:
{
- rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node.decoded;
int cnt = 0;
if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
{
/* found the existing chunk */
chunk_exists = true;
- node_inner_128_update(n128, chunk, child);
+ node_inner_128_update(n128, chunk, child.encoded);
break;
}
- if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+ if (unlikely(!NODE_HAS_FREE_SLOT(node)))
{
+ rt_node_ptr new;
rt_node_inner_256 *new256;
/* grow node from 128 to 256 */
- new256 = (rt_node_inner_256 *) rt_copy_node(tree, (rt_node *) n128,
- RT_NODE_KIND_256);
+ new = rt_copy_node(tree, node, RT_NODE_KIND_256);
+ new256 = (rt_node_inner_256 *) new.decoded;
+
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
{
if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
@@ -1315,33 +1393,32 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
cnt++;
}
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
- key);
- node = (rt_node *) new256;
+ Assert(RTNodePtrIsValid(parent));
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
- node_inner_128_insert(n128, chunk, child);
+ node_inner_128_insert(n128, chunk, child.encoded);
break;
}
}
/* FALLTHROUGH */
case RT_NODE_KIND_256:
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
- Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+ Assert(chunk_exists || NODE_HAS_FREE_SLOT(node));
- node_inner_256_set(n256, chunk, child);
+ node_inner_256_set(n256, chunk, child.encoded);
break;
}
}
/* Update statistics */
if (!chunk_exists)
- node->count++;
+ NODE_COUNT(node)++;
/*
* Done. Finally, verify the chunk and value is inserted or replaced
@@ -1354,19 +1431,19 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
/* Insert the value to the leaf node */
static bool
-rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
uint64 key, uint64 value)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool chunk_exists = false;
Assert(NODE_IS_LEAF(node));
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
int idx;
idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1378,21 +1455,22 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
break;
}
- if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+ if (unlikely(!NODE_HAS_FREE_SLOT(node)))
{
+ rt_node_ptr new;
rt_node_leaf_32 *new32;
/* grow node from 4 to 32 */
- new32 = (rt_node_leaf_32 *) rt_copy_node(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ new = rt_copy_node(tree, node, RT_NODE_KIND_32);
+ new32 = (rt_node_leaf_32 *) new.decoded;
+
chunk_values_array_copy(n4->base.chunks, n4->values,
new32->base.chunks, new32->values,
n4->base.n.count);
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
- key);
- node = (rt_node *) new32;
+ Assert(RTNodePtrIsValid(parent));
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1412,7 +1490,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* FALLTHROUGH */
case RT_NODE_KIND_32:
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
int idx;
idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1424,20 +1502,21 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
break;
}
- if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+ if (unlikely(!NODE_HAS_FREE_SLOT(node)))
{
+ rt_node_ptr new;
rt_node_leaf_128 *new128;
/* grow node from 32 to 128 */
- new128 = (rt_node_leaf_128 *) rt_copy_node(tree, (rt_node *) n32,
- RT_NODE_KIND_128);
+ new = rt_copy_node(tree, node, RT_NODE_KIND_128);
+ new128 = (rt_node_leaf_128 *) new.decoded;
+
for (int i = 0; i < n32->base.n.count; i++)
node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
- key);
- node = (rt_node *) new128;
+ Assert(RTNodePtrIsValid(parent));
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1456,7 +1535,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* FALLTHROUGH */
case RT_NODE_KIND_128:
{
- rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node.decoded;
int cnt = 0;
if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
@@ -1467,13 +1546,15 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
break;
}
- if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+ if (unlikely(!NODE_HAS_FREE_SLOT(node)))
{
+ rt_node_ptr new;
rt_node_leaf_256 *new256;
/* grow node from 128 to 256 */
- new256 = (rt_node_leaf_256 *) rt_copy_node(tree, (rt_node *) n128,
- RT_NODE_KIND_256);
+ new = rt_copy_node(tree, node, RT_NODE_KIND_256);
+ new256 = (rt_node_leaf_256 *) new.decoded;
+
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
{
if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
@@ -1483,10 +1564,9 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
cnt++;
}
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
- key);
- node = (rt_node *) new256;
+ Assert(RTNodePtrIsValid(parent));
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1497,10 +1577,10 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* FALLTHROUGH */
case RT_NODE_KIND_256:
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
- Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+ Assert(chunk_exists || NODE_HAS_FREE_SLOT(node));
node_leaf_256_set(n256, chunk, value);
break;
@@ -1509,7 +1589,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* Update statistics */
if (!chunk_exists)
- node->count++;
+ NODE_COUNT(node)++;
/*
* Done. Finally, verify the chunk and value is inserted or replaced
@@ -1533,7 +1613,7 @@ rt_create(MemoryContext ctx)
tree = palloc(sizeof(radix_tree));
tree->context = ctx;
- tree->root = NULL;
+ tree->root = InvalidRTPointer;
tree->max_val = 0;
tree->num_keys = 0;
@@ -1582,26 +1662,24 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
{
int shift;
bool updated;
- rt_node *node;
- rt_node *parent = tree->root;
+ rt_node_ptr node;
+ rt_node_ptr parent;
/* Empty tree, create the root */
- if (!tree->root)
+ if (!RTPointerIsValid(tree->root))
rt_new_root(tree, key);
/* Extend the tree if necessary */
if (key > tree->max_val)
rt_extend(tree, key);
- Assert(tree->root);
-
- shift = tree->root->shift;
- node = tree->root;
-
/* Descend the tree until a leaf node */
+ parent = rt_node_ptr_encoded(tree->root);
+ node = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
while (shift >= 0)
{
- rt_node *child;
+ rt_pointer child;
if (NODE_IS_LEAF(node))
break;
@@ -1613,7 +1691,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
}
parent = node;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
}
@@ -1634,21 +1712,21 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
bool
rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
{
- rt_node *node;
+ rt_node_ptr node;
int shift;
Assert(value_p != NULL);
- if (!tree->root || key > tree->max_val)
+ if (!RTPointerIsValid(tree->root) || key > tree->max_val)
return false;
- node = tree->root;
- shift = tree->root->shift;
+ node = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
/* Descend the tree until a leaf node */
while (shift >= 0)
{
- rt_node *child;
+ rt_pointer child;
if (NODE_IS_LEAF(node))
break;
@@ -1656,7 +1734,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
}
@@ -1670,8 +1748,8 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
bool
rt_delete(radix_tree *tree, uint64 key)
{
- rt_node *node;
- rt_node *stack[RT_MAX_LEVEL] = {0};
+ rt_node_ptr node;
+ rt_node_ptr stack[RT_MAX_LEVEL] = {0};
int shift;
int level;
bool deleted;
@@ -1683,12 +1761,12 @@ rt_delete(radix_tree *tree, uint64 key)
* Descend the tree to search the key while building a stack of nodes we
* visited.
*/
- node = tree->root;
- shift = tree->root->shift;
+ node = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
level = -1;
while (shift > 0)
{
- rt_node *child;
+ rt_pointer child;
/* Push the current node to the stack */
stack[++level] = node;
@@ -1696,7 +1774,7 @@ rt_delete(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
}
@@ -1745,7 +1823,7 @@ rt_delete(radix_tree *tree, uint64 key)
*/
if (level == 0)
{
- tree->root = NULL;
+ tree->root = InvalidRTPointer;
tree->max_val = 0;
}
@@ -1757,6 +1835,7 @@ rt_iter *
rt_begin_iterate(radix_tree *tree)
{
MemoryContext old_ctx;
+ rt_node_ptr root;
rt_iter *iter;
int top_level;
@@ -1766,17 +1845,18 @@ rt_begin_iterate(radix_tree *tree)
iter->tree = tree;
/* empty tree */
- if (!iter->tree)
+ if (!RTPointerIsValid(iter->tree))
return iter;
- top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ root = rt_node_ptr_encoded(iter->tree->root);
+ top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
iter->stack_len = top_level;
/*
* Descend to the left most leaf node from the root. The key is being
* constructed while descending to the leaf.
*/
- rt_update_iter_stack(iter, iter->tree->root, top_level);
+ rt_update_iter_stack(iter, root, top_level);
MemoryContextSwitchTo(old_ctx);
@@ -1787,14 +1867,15 @@ rt_begin_iterate(radix_tree *tree)
* Update each node_iter for inner nodes in the iterator node stack.
*/
static void
-rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
{
int level = from;
- rt_node *node = from_node;
+ rt_node_ptr node = from_node;
for (;;)
{
rt_node_iter *node_iter = &(iter->stack[level--]);
+ bool found PG_USED_FOR_ASSERTS_ONLY;
node_iter->node = node;
node_iter->current_idx = -1;
@@ -1804,10 +1885,10 @@ rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
return;
/* Advance to the next slot in the inner node */
- node = rt_node_inner_iterate_next(iter, node_iter);
+ found = rt_node_inner_iterate_next(iter, node_iter, &node);
/* We must find the first children in the node */
- Assert(node);
+ Assert(found);
}
}
@@ -1824,7 +1905,7 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
for (;;)
{
- rt_node *child = NULL;
+ rt_node_ptr child = InvalidRTNodePtr;
uint64 value;
int level;
bool found;
@@ -1845,14 +1926,12 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
*/
for (level = 1; level <= iter->stack_len; level++)
{
- child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
-
- if (child)
+ if (rt_node_inner_iterate_next(iter, &(iter->stack[level]), &child))
break;
}
/* the iteration finished */
- if (!child)
+ if (!RTNodePtrIsValid(child))
return false;
/*
@@ -1884,18 +1963,19 @@ rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
* Advance the slot in the inner node. Return the child if exists, otherwise
* null.
*/
-static inline rt_node *
-rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+static inline bool
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *child_p)
{
- rt_node *child = NULL;
+ rt_node_ptr node = node_iter->node;
+ rt_pointer child;
bool found = false;
uint8 key_chunk;
- switch (node_iter->node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n4->base.n.count)
@@ -1908,7 +1988,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
case RT_NODE_KIND_32:
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n32->base.n.count)
@@ -1921,7 +2001,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
case RT_NODE_KIND_128:
{
- rt_node_inner_128 *n128 = (rt_node_inner_128 *) node_iter->node;
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -1941,7 +2021,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
case RT_NODE_KIND_256:
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -1962,9 +2042,12 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
if (found)
- rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ {
+ rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
+ *child_p = rt_node_ptr_encoded(child);
+ }
- return child;
+ return found;
}
/*
@@ -1972,19 +2055,18 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
* is set to value_p, otherwise return false.
*/
static inline bool
-rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
- uint64 *value_p)
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_p)
{
- rt_node *node = node_iter->node;
+ rt_node_ptr node = node_iter->node;
bool found = false;
uint64 value;
uint8 key_chunk;
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n4->base.n.count)
@@ -1997,7 +2079,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
}
case RT_NODE_KIND_32:
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n32->base.n.count)
@@ -2010,7 +2092,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
}
case RT_NODE_KIND_128:
{
- rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node_iter->node;
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2030,7 +2112,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
}
case RT_NODE_KIND_256:
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2052,7 +2134,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
if (found)
{
- rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
*value_p = value;
}
@@ -2089,16 +2171,16 @@ rt_memory_usage(radix_tree *tree)
* Verify the radix tree node.
*/
static void
-rt_verify_node(rt_node *node)
+rt_verify_node(rt_node_ptr node)
{
#ifdef USE_ASSERT_CHECKING
- Assert(node->count >= 0);
+ Assert(NODE_COUNT(node) >= 0);
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+ rt_node_base_4 *n4 = (rt_node_base_4 *) node.decoded;
for (int i = 1; i < n4->n.count; i++)
Assert(n4->chunks[i - 1] < n4->chunks[i]);
@@ -2107,7 +2189,7 @@ rt_verify_node(rt_node *node)
}
case RT_NODE_KIND_32:
{
- rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+ rt_node_base_32 *n32 = (rt_node_base_32 *) node.decoded;
for (int i = 1; i < n32->n.count; i++)
Assert(n32->chunks[i - 1] < n32->chunks[i]);
@@ -2116,7 +2198,7 @@ rt_verify_node(rt_node *node)
}
case RT_NODE_KIND_128:
{
- rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+ rt_node_base_128 *n128 = (rt_node_base_128 *) node.decoded;
int cnt = 0;
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2126,10 +2208,10 @@ rt_verify_node(rt_node *node)
/* Check if the corresponding slot is used */
if (NODE_IS_LEAF(node))
- Assert(node_leaf_128_is_slot_used((rt_node_leaf_128 *) node,
+ Assert(node_leaf_128_is_slot_used((rt_node_leaf_128 *) n128,
n128->slot_idxs[i]));
else
- Assert(node_inner_128_is_slot_used((rt_node_inner_128 *) node,
+ Assert(node_inner_128_is_slot_used((rt_node_inner_128 *) n128,
n128->slot_idxs[i]));
cnt++;
@@ -2142,7 +2224,7 @@ rt_verify_node(rt_node *node)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
int cnt = 0;
for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
@@ -2163,9 +2245,11 @@ rt_verify_node(rt_node *node)
void
rt_stats(radix_tree *tree)
{
+ rt_node_ptr root = rt_node_ptr_encoded(tree->root);
+
ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
tree->num_keys,
- tree->root->shift / RT_NODE_SPAN,
+ NODE_SHIFT(root) / RT_NODE_SPAN,
tree->cnt[0],
tree->cnt[1],
tree->cnt[2],
@@ -2173,42 +2257,44 @@ rt_stats(radix_tree *tree)
}
static void
-rt_dump_node(rt_node *node, int level, bool recurse)
+rt_dump_node(rt_node_ptr node, int level, bool recurse)
{
+ rt_node *n = node.decoded;
char space[128] = {0};
fprintf(stderr, "[%s] kind %d, count %u, shift %u, chunk 0x%X:\n",
NODE_IS_LEAF(node) ? "LEAF" : "INNR",
- (node->kind == RT_NODE_KIND_4) ? 4 :
- (node->kind == RT_NODE_KIND_32) ? 32 :
- (node->kind == RT_NODE_KIND_128) ? 128 : 256,
- node->count, node->shift, node->chunk);
+ (NODE_KIND(node) == RT_NODE_KIND_4) ? 4 :
+ (NODE_KIND(node) == RT_NODE_KIND_32) ? 32 :
+ (NODE_KIND(node) == RT_NODE_KIND_128) ? 128 : 256,
+ n->count, n->shift, n->chunk);
if (level > 0)
sprintf(space, "%*c", level * 4, ' ');
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- for (int i = 0; i < node->count; i++)
+ for (int i = 0; i < NODE_COUNT(node); i++)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
space, n4->base.chunks[i], n4->values[i]);
}
else
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
fprintf(stderr, "%schunk 0x%X ->",
space, n4->base.chunks[i]);
if (recurse)
- rt_dump_node(n4->children[i], level + 1, recurse);
+ rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+ level + 1, recurse);
else
fprintf(stderr, "\n");
}
@@ -2217,25 +2303,26 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
case RT_NODE_KIND_32:
{
- for (int i = 0; i < node->count; i++)
+ for (int i = 0; i < NODE_KIND(node); i++)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
space, n32->base.chunks[i], n32->values[i]);
}
else
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
fprintf(stderr, "%schunk 0x%X ->",
space, n32->base.chunks[i]);
if (recurse)
{
- rt_dump_node(n32->children[i], level + 1, recurse);
+ rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+ level + 1, recurse);
}
else
fprintf(stderr, "\n");
@@ -2245,7 +2332,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
case RT_NODE_KIND_128:
{
- rt_node_base_128 *b128 = (rt_node_base_128 *) node;
+ rt_node_base_128 *b128 = (rt_node_base_128 *) node.decoded;
fprintf(stderr, "slot_idxs ");
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2257,7 +2344,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_128 *n = (rt_node_leaf_128 *) node;
+ rt_node_leaf_128 *n = (rt_node_leaf_128 *) node.decoded;
fprintf(stderr, ", isset-bitmap:");
for (int i = 0; i < 16; i++)
@@ -2287,7 +2374,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(node_inner_128_get_child(n128, i),
+ rt_dump_node(rt_node_ptr_encoded(node_inner_128_get_child(n128, i)),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2301,7 +2388,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
if (!node_leaf_256_is_chunk_used(n256, i))
continue;
@@ -2311,7 +2398,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
else
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
if (!node_inner_256_is_chunk_used(n256, i))
continue;
@@ -2320,8 +2407,8 @@ rt_dump_node(rt_node *node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
- recurse);
+ rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+ level + 1, recurse);
else
fprintf(stderr, "\n");
}
@@ -2334,14 +2421,14 @@ rt_dump_node(rt_node *node, int level, bool recurse)
void
rt_dump_search(radix_tree *tree, uint64 key)
{
- rt_node *node;
+ rt_node_ptr node;
int shift;
int level = 0;
elog(NOTICE, "-----------------------------------------------------------");
elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
- if (!tree->root)
+ if (!RTPointerIsValid(tree->root))
{
elog(NOTICE, "tree is empty");
return;
@@ -2354,11 +2441,11 @@ rt_dump_search(radix_tree *tree, uint64 key)
return;
}
- node = tree->root;
- shift = tree->root->shift;
+ node = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
while (shift >= 0)
{
- rt_node *child;
+ rt_pointer child;
rt_dump_node(node, level, false);
@@ -2375,7 +2462,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
break;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
level++;
}
@@ -2384,6 +2471,8 @@ rt_dump_search(radix_tree *tree, uint64 key)
void
rt_dump(radix_tree *tree)
{
+ rt_node_ptr root;
+
for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
rt_node_kind_info[i].name,
@@ -2393,12 +2482,13 @@ rt_dump(radix_tree *tree)
rt_node_kind_info[i].leaf_blocksize);
fprintf(stderr, "max_val = %lu\n", tree->max_val);
- if (!tree->root)
+ if (!RTPointerIsValid(tree->root))
{
fprintf(stderr, "empty tree\n");
return;
}
- rt_dump_node(tree->root, 0, true);
+ root = rt_node_ptr_encoded(tree->root);
+ rt_dump_node(root, 0, true);
}
#endif
--
2.31.1
v9-0005-PoC-DSA-support-for-radix-tree.patchapplication/octet-stream; name=v9-0005-PoC-DSA-support-for-radix-tree.patchDownload
From a304e99926444dda3861722c53d9cbd86e61fec0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 27 Oct 2022 14:02:00 +0900
Subject: [PATCH v9 5/6] PoC: DSA support for radix tree.
---
.../bench_radix_tree--1.0.sql | 2 +
contrib/bench_radix_tree/bench_radix_tree.c | 12 +-
src/backend/lib/radixtree.c | 484 +++++++++++++-----
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 8 +-
src/include/utils/dsa.h | 1 +
.../expected/test_radixtree.out | 17 +
.../modules/test_radixtree/test_radixtree.c | 100 ++--
8 files changed, 482 insertions(+), 154 deletions(-)
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 0874201d7e..cf294c01d6 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -7,6 +7,7 @@ create function bench_shuffle_search(
minblk int4,
maxblk int4,
random_block bool DEFAULT false,
+shared bool DEFAULT false,
OUT nkeys int8,
OUT rt_mem_allocated int8,
OUT array_mem_allocated int8,
@@ -23,6 +24,7 @@ create function bench_seq_search(
minblk int4,
maxblk int4,
random_block bool DEFAULT false,
+shared bool DEFAULT false,
OUT nkeys int8,
OUT rt_mem_allocated int8,
OUT array_mem_allocated int8,
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 7abb237e96..be3f7ed811 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -15,6 +15,7 @@
#include "lib/radixtree.h"
#include <math.h>
#include "miscadmin.h"
+#include "storage/lwlock.h"
#include "utils/timestamp.h"
PG_MODULE_MAGIC;
@@ -149,7 +150,9 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
BlockNumber minblk = PG_GETARG_INT32(0);
BlockNumber maxblk = PG_GETARG_INT32(1);
bool random_block = PG_GETARG_BOOL(2);
+ bool shared = PG_GETARG_BOOL(3);
radix_tree *rt = NULL;
+ dsa_area *dsa = NULL;
uint64 ntids;
uint64 key;
uint64 last_key = PG_UINT64_MAX;
@@ -171,8 +174,11 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+ if (shared)
+ dsa = dsa_create(LWLockNewTrancheId());
+
/* measure the load time of the radix tree */
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, dsa);
start_time = GetCurrentTimestamp();
for (int i = 0; i < ntids; i++)
{
@@ -323,7 +329,7 @@ bench_load_random_int(PG_FUNCTION_ARGS)
elog(ERROR, "return type must be a row type");
pg_prng_seed(&state, 0);
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
start_time = GetCurrentTimestamp();
for (uint64 i = 0; i < cnt; i++)
@@ -375,7 +381,7 @@ bench_fixed_height_search(PG_FUNCTION_ARGS)
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
start_time = GetCurrentTimestamp();
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index c25d455d2a..fb35463b66 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -22,6 +22,15 @@
* choose it to avoid an additional pointer traversal. It is the reason this code
* currently does not support variable-length keys.
*
+ * If DSA space is specified when rt_create(), the radix tree is created in the
+ * DSA space so that multiple processes can access to it simultaneously. The process
+ * who created the shared radix tree need to tell both DSA area specified when
+ * calling to rt_create() and dsa_pointer of the radix tree, fetched by
+ * rt_get_dsa_pointer(), other processes so that they can attach by rt_attach().
+ *
+ * XXX: shared radix tree is still PoC state as it doesn't have any locking support.
+ * Also, it supports only single-process iteration.
+ *
* XXX: Most functions in this file have two variants for inner nodes and leaf
* nodes, therefore there are duplication codes. While this sometimes makes the
* code maintenance tricky, this reduces branch prediction misses when judging
@@ -34,6 +43,9 @@
*
* rt_create - Create a new, empty radix tree
* rt_free - Free the radix tree
+ * rt_attach - Attach to the radix tree
+ * rt_detach - Detach from the radix tree
+ * rt_get_handle - Return the handle of the radix tree
* rt_search - Search a key-value pair
* rt_set - Set a key-value pair
* rt_delete - Delete a key-value pair
@@ -64,6 +76,7 @@
#include "miscadmin.h"
#include "port/pg_bitutils.h"
#include "port/pg_lfind.h"
+#include "utils/dsa.h"
#include "utils/memutils.h"
/* The number of bits encoded in one tree level */
@@ -384,6 +397,11 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
* construct the key whenever updating the node iteration information, e.g., when
* advancing the current index within the node or when moving to the next node
* at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than rt_node_ptr.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
*/
typedef struct rt_node_iter
{
@@ -403,23 +421,43 @@ struct rt_iter
uint64 key;
};
-/* A radix tree with nodes */
-struct radix_tree
+/* A magic value used to identify our radix tree */
+#define RADIXTREE_MAGIC 0x54A48167
+
+/* Control information for an radix tree */
+typedef struct radix_tree_control
{
- MemoryContext context;
+ rt_handle handle;
+ uint32 magic;
+ /* Root node */
rt_pointer root;
- uint64 max_val;
- uint64 num_keys;
- MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
- MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+ pg_atomic_uint64 max_val;
+ pg_atomic_uint64 num_keys;
/* statistics */
#ifdef RT_DEBUG
int32 cnt[RT_NODE_KIND_COUNT];
#endif
+} radix_tree_control;
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ /* control object in either backend-local memory or DSA */
+ radix_tree_control *ctl;
+
+ /* used only when the radix tree is shared */
+ dsa_area *area;
+
+ /* used only when the radix tree is private */
+ MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+ MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
};
+#define RadixTreeIsShared(rt) ((rt)->area != NULL)
static void rt_new_root(radix_tree *tree, uint64 key);
static rt_node_ptr rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
@@ -446,24 +484,31 @@ static void rt_verify_node(rt_node_ptr node);
/* Decode and encode function of rt_pointer */
static inline rt_node *
-rt_pointer_decode(rt_pointer encoded)
+rt_pointer_decode(radix_tree *tree, rt_pointer encoded)
{
- return (rt_node *) RTPointerUnTagKind(encoded);
+ encoded = RTPointerUnTagKind(encoded);
+
+ if (RadixTreeIsShared(tree))
+ return (rt_node *) dsa_get_address(tree->area, encoded);
+ else
+ return (rt_node *) encoded;
}
static inline rt_pointer
-rt_pointer_encode(rt_node *decoded, uint8 kind)
+rt_pointer_encode(rt_pointer decoded, uint8 kind)
{
+ Assert((decoded & RT_POINTER_KIND_MASK) == 0);
+
return (rt_pointer) RTPointerTagKind(decoded, kind);
}
/* Return a rt_pointer created from the given encoded pointer */
static inline rt_node_ptr
-rt_node_ptr_encoded(rt_pointer encoded)
+rt_node_ptr_encoded(radix_tree *tree, rt_pointer encoded)
{
return (rt_node_ptr) {
.encoded = encoded,
- .decoded = rt_pointer_decode(encoded)
+ .decoded = rt_pointer_decode(tree, encoded)
};
}
@@ -908,8 +953,8 @@ rt_new_root(radix_tree *tree, uint64 key)
rt_node_ptr node;
node = rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0, shift > 0);
- tree->max_val = shift_get_max_val(shift);
- tree->root = node.encoded;
+ pg_atomic_write_u64(&tree->ctl->max_val, shift_get_max_val(shift));
+ tree->ctl->root = node.encoded;
}
/*
@@ -918,16 +963,35 @@ rt_new_root(radix_tree *tree, uint64 key)
static rt_node_ptr
rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
{
- rt_node_ptr newnode;
+ rt_node_ptr newnode;
+
+ if (tree->area != NULL)
+ {
+ dsa_pointer dp;
+
+ if (inner)
+ dp = dsa_allocate0(tree->area, rt_node_kind_info[kind].inner_size);
+ else
+ dp = dsa_allocate0(tree->area, rt_node_kind_info[kind].leaf_size);
- if (inner)
- newnode.decoded = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
- rt_node_kind_info[kind].inner_size);
+ newnode.encoded = rt_pointer_encode((rt_pointer) dp, kind);
+ newnode.decoded = (rt_node *) dsa_get_address(tree->area, dp);
+ }
else
- newnode.decoded = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
- rt_node_kind_info[kind].leaf_size);
+ {
+ rt_node *new;
+
+ if (inner)
+ new = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_kind_info[kind].inner_size);
+ else
+ new = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_kind_info[kind].leaf_size);
+
+ newnode.encoded = rt_pointer_encode((rt_pointer) new, kind);
+ newnode.decoded = new;
+ }
- newnode.encoded = rt_pointer_encode(newnode.decoded, kind);
NODE_SHIFT(newnode) = shift;
NODE_CHUNK(newnode) = chunk;
@@ -941,7 +1005,7 @@ rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
#ifdef RT_DEBUG
/* update the statistics */
- tree->cnt[kind]++;
+ tree->ctl->cnt[kind]++;
#endif
return newnode;
@@ -968,16 +1032,19 @@ static void
rt_free_node(radix_tree *tree, rt_node_ptr node)
{
/* If we're deleting the root node, make the tree empty */
- if (tree->root == node.encoded)
- tree->root = InvalidRTPointer;
+ if (tree->ctl->root == node.encoded)
+ tree->ctl->root = InvalidRTPointer;
#ifdef RT_DEBUG
/* update the statistics */
- tree->cnt[NODE_KIND(node)]--;
- Assert(tree->cnt[NODE_KIND(node)] >= 0);
+ tree->ctl->cnt[NODE_KIND(node)]--;
+ Assert(tree->ctl->cnt[NODE_KIND(node)] >= 0);
#endif
- pfree(node.decoded);
+ if (RadixTreeIsShared(tree))
+ dsa_free(tree->area, (dsa_pointer) RTPointerUnTagKind(node.encoded));
+ else
+ pfree(node.decoded);
}
/*
@@ -993,7 +1060,7 @@ rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
if (rt_node_ptr_eq(&parent, &old_child))
{
/* Replace the root node with the new large node */
- tree->root = new_child.encoded;
+ tree->ctl->root = new_child.encoded;
}
else
{
@@ -1015,7 +1082,7 @@ static void
rt_extend(radix_tree *tree, uint64 key)
{
int target_shift;
- rt_node *root = rt_pointer_decode(tree->root);
+ rt_node *root = rt_pointer_decode(tree, tree->ctl->root);
int shift = root->shift + RT_NODE_SPAN;
target_shift = key_get_shift(key);
@@ -1031,15 +1098,15 @@ rt_extend(radix_tree *tree, uint64 key)
n4->base.n.count = 1;
n4->base.chunks[0] = 0;
- n4->children[0] = tree->root;
+ n4->children[0] = tree->ctl->root;
root->chunk = 0;
- tree->root = node.encoded;
+ tree->ctl->root = node.encoded;
shift += RT_NODE_SPAN;
}
- tree->max_val = shift_get_max_val(target_shift);
+ pg_atomic_write_u64(&tree->ctl->max_val, shift_get_max_val(target_shift));
}
/*
@@ -1068,7 +1135,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
}
rt_node_insert_leaf(tree, parent, node, key, value);
- tree->num_keys++;
+ pg_atomic_add_fetch_u64(&tree->ctl->num_keys, 1);
}
/*
@@ -1079,8 +1146,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
* pointer is set to child_p.
*/
static inline bool
-rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
- rt_pointer *child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action, rt_pointer *child_p)
{
uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool found = false;
@@ -1115,6 +1181,7 @@ rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
break;
found = true;
+
if (action == RT_ACTION_FIND)
child = n32->children[idx];
else /* RT_ACTION_DELETE */
@@ -1604,33 +1671,50 @@ rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
* Create the radix tree in the given memory context and return it.
*/
radix_tree *
-rt_create(MemoryContext ctx)
+rt_create(MemoryContext ctx, dsa_area *area)
{
radix_tree *tree;
MemoryContext old_ctx;
old_ctx = MemoryContextSwitchTo(ctx);
- tree = palloc(sizeof(radix_tree));
+ tree = (radix_tree *) palloc0(sizeof(radix_tree));
tree->context = ctx;
- tree->root = InvalidRTPointer;
- tree->max_val = 0;
- tree->num_keys = 0;
+
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+
+ tree->area = area;
+ dp = dsa_allocate0(area, sizeof(radix_tree_control));
+ tree->ctl = (radix_tree_control *) dsa_get_address(area, dp);
+ tree->ctl->handle = (rt_handle) dp;
+ }
+ else
+ {
+ tree->ctl = (radix_tree_control *) palloc0(sizeof(radix_tree_control));
+ tree->ctl->handle = InvalidDsaPointer;
+ }
+
+ tree->ctl->magic = RADIXTREE_MAGIC;
+ tree->ctl->root = InvalidRTPointer;
+ pg_atomic_init_u64(&tree->ctl->max_val, 0);
+ pg_atomic_init_u64(&tree->ctl->num_keys, 0);
/* Create the slab allocator for each size class */
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ if (area == NULL)
{
- tree->inner_slabs[i] = SlabContextCreate(ctx,
- rt_node_kind_info[i].name,
- rt_node_kind_info[i].inner_blocksize,
- rt_node_kind_info[i].inner_size);
- tree->leaf_slabs[i] = SlabContextCreate(ctx,
- rt_node_kind_info[i].name,
- rt_node_kind_info[i].leaf_blocksize,
- rt_node_kind_info[i].leaf_size);
-#ifdef RT_DEBUG
- tree->cnt[i] = 0;
-#endif
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].inner_blocksize,
+ rt_node_kind_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].leaf_blocksize,
+ rt_node_kind_info[i].leaf_size);
+ }
}
MemoryContextSwitchTo(old_ctx);
@@ -1638,16 +1722,159 @@ rt_create(MemoryContext ctx)
return tree;
}
+/*
+ * Get a handle that can be used by other processes to attach to this radix
+ * tree.
+ */
+dsa_pointer
+rt_get_handle(radix_tree *tree)
+{
+ Assert(RadixTreeIsShared(tree));
+ Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+ return tree->ctl->handle;
+}
+
+/*
+ * Attach to an existing radix tree using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+radix_tree *
+rt_attach(dsa_area *area, rt_handle handle)
+{
+ radix_tree *tree;
+ dsa_pointer control;
+
+ /* Allocate the backend-local object representing the radix tree */
+ tree = (radix_tree *) palloc0(sizeof(radix_tree));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ /* Set up the local radix tree */
+ tree->area = area;
+ tree->ctl = (radix_tree_control *) dsa_get_address(area, control);
+ Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+ return tree;
+}
+
+/*
+ * Detach from a radix tree. This frees backend-local resources associated
+ * with the radix tree, but the radix tree will continue to exist until
+ * it is explicitly freed.
+ */
+void
+rt_detach(radix_tree *tree)
+{
+ Assert(RadixTreeIsShared(tree));
+ Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+ pfree(tree);
+}
+
+/*
+ * Recursively free all nodes allocated to the dsa area.
+ */
+static void
+rt_free_recurse(radix_tree *tree, rt_pointer ptr)
+{
+ rt_node_ptr node = rt_node_ptr_encoded(tree, ptr);
+
+ Assert(RadixTreeIsShared(tree));
+
+ /* The leaf node doesn't have child pointers, so free it */
+ if (NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->area, RTPointerUnTagKind(node.encoded));
+ return;
+ }
+
+ switch (NODE_KIND(node))
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < NODE_COUNT(node); i++)
+ rt_free_recurse(tree, n4->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < NODE_COUNT(node); i++)
+ rt_free_recurse(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ continue;
+
+ rt_free_recurse(tree, node_inner_128_get_child(n128, i));
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_inner_256_is_chunk_used(n256, i))
+ continue;
+
+ rt_free_recurse(tree, node_inner_256_get_child(n256, i));
+ }
+ break;
+ }
+ }
+
+ /* Free the inner node itself */
+ dsa_free(tree->area, RTPointerUnTagKind(node.encoded));
+}
+
/*
* Free the given radix tree.
*/
void
rt_free(radix_tree *tree)
{
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+ if (RadixTreeIsShared(tree))
{
- MemoryContextDelete(tree->inner_slabs[i]);
- MemoryContextDelete(tree->leaf_slabs[i]);
+ /* Free all memory used for radix tree nodes */
+ rt_free_recurse(tree, tree->ctl->root);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->area, tree->ctl->handle);
+ }
+ else
+ {
+ /* Free all memory used for radix tree nodes */
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+ pfree(tree->ctl);
}
pfree(tree);
@@ -1665,17 +1892,19 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
rt_node_ptr node;
rt_node_ptr parent;
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
/* Empty tree, create the root */
- if (!RTPointerIsValid(tree->root))
+ if (!RTPointerIsValid(tree->ctl->root))
rt_new_root(tree, key);
/* Extend the tree if necessary */
- if (key > tree->max_val)
+ if (key > pg_atomic_read_u64(&tree->ctl->max_val))
rt_extend(tree, key);
/* Descend the tree until a leaf node */
- parent = rt_node_ptr_encoded(tree->root);
- node = rt_node_ptr_encoded(tree->root);
+ parent = rt_node_ptr_encoded(tree, tree->ctl->root);
+ node = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
while (shift >= 0)
{
@@ -1691,7 +1920,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
}
parent = node;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
}
@@ -1699,7 +1928,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
/* Update the statistics */
if (!updated)
- tree->num_keys++;
+ pg_atomic_add_fetch_u64(&tree->ctl->num_keys, 1);
return updated;
}
@@ -1715,12 +1944,14 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
rt_node_ptr node;
int shift;
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
Assert(value_p != NULL);
- if (!RTPointerIsValid(tree->root) || key > tree->max_val)
+ if (!RTPointerIsValid(tree->ctl->root) ||
+ key > pg_atomic_read_u64(&tree->ctl->max_val))
return false;
- node = rt_node_ptr_encoded(tree->root);
+ node = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
/* Descend the tree until a leaf node */
@@ -1734,7 +1965,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
}
@@ -1754,14 +1985,17 @@ rt_delete(radix_tree *tree, uint64 key)
int level;
bool deleted;
- if (!tree->root || key > tree->max_val)
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+ if (!RTPointerIsValid(tree->ctl->root) ||
+ key > pg_atomic_read_u64(&tree->ctl->max_val))
return false;
/*
* Descend the tree to search the key while building a stack of nodes we
* visited.
*/
- node = rt_node_ptr_encoded(tree->root);
+ node = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
level = -1;
while (shift > 0)
@@ -1774,7 +2008,7 @@ rt_delete(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
}
@@ -1789,7 +2023,7 @@ rt_delete(radix_tree *tree, uint64 key)
}
/* Found the key to delete. Update the statistics */
- tree->num_keys--;
+ pg_atomic_sub_fetch_u64(&tree->ctl->num_keys, 1);
/*
* Return if the leaf node still has keys and we don't need to delete the
@@ -1823,8 +2057,8 @@ rt_delete(radix_tree *tree, uint64 key)
*/
if (level == 0)
{
- tree->root = InvalidRTPointer;
- tree->max_val = 0;
+ tree->ctl->root = InvalidRTPointer;
+ pg_atomic_write_u64(&tree->ctl->max_val, 0);
}
return true;
@@ -1839,6 +2073,8 @@ rt_begin_iterate(radix_tree *tree)
rt_iter *iter;
int top_level;
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
old_ctx = MemoryContextSwitchTo(tree->context);
iter = (rt_iter *) palloc0(sizeof(rt_iter));
@@ -1848,7 +2084,7 @@ rt_begin_iterate(radix_tree *tree)
if (!RTPointerIsValid(iter->tree))
return iter;
- root = rt_node_ptr_encoded(iter->tree->root);
+ root = rt_node_ptr_encoded(tree, iter->tree->ctl->root);
top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
iter->stack_len = top_level;
@@ -1899,6 +2135,8 @@ rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
bool
rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
{
+ Assert(!RadixTreeIsShared(iter->tree) || iter->tree->ctl->magic == RADIXTREE_MAGIC);
+
/* Empty tree */
if (!iter->tree)
return false;
@@ -2044,7 +2282,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *
if (found)
{
rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
- *child_p = rt_node_ptr_encoded(child);
+ *child_p = rt_node_ptr_encoded(iter->tree, child);
}
return found;
@@ -2147,7 +2385,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_
uint64
rt_num_entries(radix_tree *tree)
{
- return tree->num_keys;
+ return pg_atomic_read_u64(&tree->ctl->num_keys);
}
/*
@@ -2156,12 +2394,19 @@ rt_num_entries(radix_tree *tree)
uint64
rt_memory_usage(radix_tree *tree)
{
- Size total = sizeof(radix_tree);
+ Size total = sizeof(radix_tree) + sizeof(radix_tree_control);
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+ if (RadixTreeIsShared(tree))
+ total = dsa_get_total_size(tree->area);
+ else
{
- total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
- total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
}
return total;
@@ -2245,19 +2490,19 @@ rt_verify_node(rt_node_ptr node)
void
rt_stats(radix_tree *tree)
{
- rt_node_ptr root = rt_node_ptr_encoded(tree->root);
+ rt_node_ptr root = rt_node_ptr_encoded(tree, tree->ctl->root);
ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
- tree->num_keys,
+ pg_atomic_read_u64(&tree->ctl->num_keys),
NODE_SHIFT(root) / RT_NODE_SPAN,
- tree->cnt[0],
- tree->cnt[1],
- tree->cnt[2],
- tree->cnt[3])));
+ tree->ctl->cnt[0],
+ tree->ctl->cnt[1],
+ tree->ctl->cnt[2],
+ tree->ctl->cnt[3])));
}
static void
-rt_dump_node(rt_node_ptr node, int level, bool recurse)
+rt_dump_node(radix_tree *tree, rt_node_ptr node, int level, bool recurse)
{
rt_node *n = node.decoded;
char space[128] = {0};
@@ -2293,7 +2538,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
space, n4->base.chunks[i]);
if (recurse)
- rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+ rt_dump_node(tree, rt_node_ptr_encoded(tree, n4->children[i]),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2321,7 +2566,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
if (recurse)
{
- rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+ rt_dump_node(tree, rt_node_ptr_encoded(tree, n32->children[i]),
level + 1, recurse);
}
else
@@ -2374,7 +2619,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(rt_node_ptr_encoded(node_inner_128_get_child(n128, i)),
+ rt_dump_node(tree,
+ rt_node_ptr_encoded(tree,
+ node_inner_128_get_child(n128, i)),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2407,7 +2654,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+ rt_dump_node(tree,
+ rt_node_ptr_encoded(tree,
+ node_inner_256_get_child(n256, i)),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2418,6 +2667,27 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
}
}
+void
+rt_dump(radix_tree *tree)
+{
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].inner_size,
+ rt_node_kind_info[i].inner_blocksize,
+ rt_node_kind_info[i].leaf_size,
+ rt_node_kind_info[i].leaf_blocksize);
+ fprintf(stderr, "max_val = %lu\n", pg_atomic_read_u64(&tree->ctl->max_val));
+
+ if (!tree->ctl->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ rt_dump_node(tree, rt_node_ptr_encoded(tree, tree->ctl->root), 0, true);
+}
+
void
rt_dump_search(radix_tree *tree, uint64 key)
{
@@ -2426,28 +2696,30 @@ rt_dump_search(radix_tree *tree, uint64 key)
int level = 0;
elog(NOTICE, "-----------------------------------------------------------");
- elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+ elog(NOTICE, "max_val = %lu (0x%lX)",
+ pg_atomic_read_u64(&tree->ctl->max_val),
+ pg_atomic_read_u64(&tree->ctl->max_val));
- if (!RTPointerIsValid(tree->root))
+ if (!RTPointerIsValid(tree->ctl->root))
{
elog(NOTICE, "tree is empty");
return;
}
- if (key > tree->max_val)
+ if (key > pg_atomic_read_u64(&tree->ctl->max_val))
{
elog(NOTICE, "key %lu (0x%lX) is larger than max val",
key, key);
return;
}
- node = rt_node_ptr_encoded(tree->root);
+ node = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
while (shift >= 0)
{
rt_pointer child;
- rt_dump_node(node, level, false);
+ rt_dump_node(tree, node, level, false);
if (NODE_IS_LEAF(node))
{
@@ -2462,33 +2734,9 @@ rt_dump_search(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
break;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
level++;
}
}
-
-void
-rt_dump(radix_tree *tree)
-{
- rt_node_ptr root;
-
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
- fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
- rt_node_kind_info[i].name,
- rt_node_kind_info[i].inner_size,
- rt_node_kind_info[i].inner_blocksize,
- rt_node_kind_info[i].leaf_size,
- rt_node_kind_info[i].leaf_blocksize);
- fprintf(stderr, "max_val = %lu\n", tree->max_val);
-
- if (!RTPointerIsValid(tree->root))
- {
- fprintf(stderr, "empty tree\n");
- return;
- }
-
- root = rt_node_ptr_encoded(tree->root);
- rt_dump_node(root, 0, true);
-}
#endif
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 82376fde2d..ad169882af 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d5d7668617..68a11df970 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -14,18 +14,24 @@
#define RADIXTREE_H
#include "postgres.h"
+#include "utils/dsa.h"
#define RT_DEBUG 1
typedef struct radix_tree radix_tree;
typedef struct rt_iter rt_iter;
+typedef dsa_pointer rt_handle;
-extern radix_tree *rt_create(MemoryContext ctx);
+extern radix_tree *rt_create(MemoryContext ctx, dsa_area *dsa);
extern void rt_free(radix_tree *tree);
extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
extern rt_iter *rt_begin_iterate(radix_tree *tree);
+extern rt_handle rt_get_handle(radix_tree *tree);
+extern radix_tree *rt_attach(dsa_area *dsa, dsa_pointer dp);
+extern void rt_detach(radix_tree *tree);
+
extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
extern void rt_end_iterate(rt_iter *iter);
extern bool rt_delete(radix_tree *tree, uint64 key);
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 405606fe2f..dad06adecc 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index cc6970c87c..a0ff1e1c77 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -5,21 +5,38 @@ CREATE EXTENSION test_radixtree;
--
SELECT test_radixtree();
NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
NOTICE: testing radix tree node types with shift "8"
NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "16"
NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
NOTICE: testing radix tree node types with shift "32"
NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
NOTICE: testing radix tree node types with shift "48"
NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
NOTICE: testing radix tree with pattern "all ones"
NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
NOTICE: testing radix tree with pattern "clusters of ten"
NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
NOTICE: testing radix tree with pattern "one-every-64k"
NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
NOTICE: testing radix tree with pattern "single values, distance > 2^32"
NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
test_radixtree
----------------
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index cb3596755d..a948cba4ec 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -19,6 +19,7 @@
#include "nodes/bitmapset.h"
#include "storage/block.h"
#include "storage/itemptr.h"
+#include "storage/lwlock.h"
#include "utils/memutils.h"
#include "utils/timestamp.h"
@@ -111,7 +112,7 @@ test_empty(void)
radix_tree *radixtree;
uint64 dummy;
- radixtree = rt_create(CurrentMemoryContext);
+ radixtree = rt_create(CurrentMemoryContext, NULL);
if (rt_search(radixtree, 0, &dummy))
elog(ERROR, "rt_search on empty tree returned true");
@@ -217,14 +218,10 @@ test_node_types_delete(radix_tree *radixtree, uint8 shift)
* level.
*/
static void
-test_node_types(uint8 shift)
+do_test_node_types(radix_tree *radixtree, uint8 shift)
{
- radix_tree *radixtree;
-
elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
- radixtree = rt_create(CurrentMemoryContext);
-
/*
* Insert and search entries for every node type at the 'shift' level,
* then delete all entries to make it empty, and insert and search entries
@@ -233,19 +230,39 @@ test_node_types(uint8 shift)
test_node_types_insert(radixtree, shift);
test_node_types_delete(radixtree, shift);
test_node_types_insert(radixtree, shift);
+}
- rt_free(radixtree);
+static void
+test_node_types(void)
+{
+ int tranche_id = LWLockNewTrancheId();
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ {
+ radix_tree *tree;
+ dsa_area *dsa;
+
+ /* Test the local radix tree */
+ tree = rt_create(CurrentMemoryContext, NULL);
+ do_test_node_types(tree, shift);
+ rt_free(tree);
+
+ /* Test the shared radix tree */
+ dsa = dsa_create(tranche_id);
+ tree = rt_create(CurrentMemoryContext, dsa);
+ do_test_node_types(tree, shift);
+ rt_free(tree);
+ dsa_detach(dsa);
+ }
}
/*
* Test with a repeating pattern, defined by the 'spec'.
*/
static void
-test_pattern(const test_spec * spec)
+do_test_pattern(radix_tree *radixtree, const test_spec * spec)
{
- radix_tree *radixtree;
rt_iter *iter;
- MemoryContext radixtree_ctx;
TimestampTz starttime;
TimestampTz endtime;
uint64 n;
@@ -271,18 +288,6 @@ test_pattern(const test_spec * spec)
pattern_values[pattern_num_values++] = i;
}
- /*
- * Allocate the radix tree.
- *
- * Allocate it in a separate memory context, so that we can print its
- * memory usage easily.
- */
- radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
- "radixtree test",
- ALLOCSET_SMALL_SIZES);
- MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
- radixtree = rt_create(radixtree_ctx);
-
/*
* Add values to the set.
*/
@@ -336,8 +341,6 @@ test_pattern(const test_spec * spec)
mem_usage = rt_memory_usage(radixtree);
fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
mem_usage, (double) mem_usage / spec->num_values);
-
- MemoryContextStats(radixtree_ctx);
}
/* Check that rt_num_entries works */
@@ -484,21 +487,54 @@ test_pattern(const test_spec * spec)
if ((nbefore - ndeleted) != nafter)
elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
nafter, (nbefore - ndeleted), ndeleted);
+}
+
+static void
+test_patterns(void)
+{
+ int tranche_id = LWLockNewTrancheId();
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ {
+ radix_tree *tree;
+ MemoryContext radixtree_ctx;
+ dsa_area *dsa;
+ const test_spec *spec = &test_specs[i];
- MemoryContextDelete(radixtree_ctx);
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+ /* Test the local radix tree */
+ tree = rt_create(radixtree_ctx, NULL);
+ do_test_pattern(tree, spec);
+ rt_free(tree);
+ MemoryContextReset(radixtree_ctx);
+
+ /* Test the shared radix tree */
+ dsa = dsa_create(tranche_id);
+ tree = rt_create(radixtree_ctx, dsa);
+ do_test_pattern(tree, spec);
+ rt_free(tree);
+ dsa_detach(dsa);
+ MemoryContextDelete(radixtree_ctx);
+ }
}
Datum
test_radixtree(PG_FUNCTION_ARGS)
{
test_empty();
-
- for (int shift = 0; shift <= (64 - 8); shift += 8)
- test_node_types(shift);
-
- /* Test different test patterns, with lots of entries */
- for (int i = 0; i < lengthof(test_specs); i++)
- test_pattern(&test_specs[i]);
+ test_node_types();
+ test_patterns();
PG_RETURN_VOID();
}
--
2.31.1
v9-0006-PoC-lazy-vacuum-integration.patchapplication/octet-stream; name=v9-0006-PoC-lazy-vacuum-integration.patchDownload
From 2cbeff1f0c195eefc1daa2400361007e112e7aac Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 4 Nov 2022 14:14:42 +0900
Subject: [PATCH v9 6/6] PoC: lazy vacuum integration.
The patch includes:
* Introducing a new module called TIDStore
* Lazy vacuum and parallel vacuum integration.
TODOs:
* radix tree needs to have the reset funtionality.
* should not allow TIDStore to grow beyond the memory limit.
* change the progress statistics of pg_stat_progress_vacuum.
---
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 280 ++++++++++++++++++++++++++
src/backend/access/heap/vacuumlazy.c | 160 +++++----------
src/backend/commands/vacuum.c | 76 +------
src/backend/commands/vacuumparallel.c | 60 +++---
src/backend/storage/lmgr/lwlock.c | 2 +
src/include/access/tidstore.h | 55 +++++
src/include/commands/vacuum.h | 24 +--
src/include/storage/lwlock.h | 1 +
10 files changed, 434 insertions(+), 226 deletions(-)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index 857beaa32d..76265974b1 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -13,6 +13,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..50ec800fd6
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * TID (ItemPointer) storage implementation.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "lib/radixtree.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* XXX: should be configurable for non-heap AMs */
+#define TIDSTORE_OFFSET_NBITS 11 /* pg_ceil_log2_32(MaxHeapTuplesPerPage) */
+
+#define TIDSTORE_VALUE_NBITS 6 /* log(sizeof(uint64) * BITS_PER_BYTE, 2) */
+
+/* Get block number from the key */
+#define KEY_GET_BLKNO(key) \
+ ((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+struct TIDStore
+{
+ /* main storage for TID */
+ radix_tree *tree;
+
+ /* # of tids in TIDStore */
+ int num_tids;
+
+ /* DSA area and handle for shared TIDStore */
+ rt_handle handle;
+ dsa_area *area;
+};
+
+static void tidstore_iter_collect_tids(TIDStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TIDStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TIDStore *
+tidstore_create(dsa_area *area)
+{
+ TIDStore *ts;
+
+ ts = palloc0(sizeof(TIDStore));
+
+ ts->tree = rt_create(CurrentMemoryContext, area);
+ ts->area = area;
+
+ if (area != NULL)
+ ts->handle = rt_get_handle(ts->tree);
+
+ return ts;
+}
+
+/* Attach to the shared TIDStore using a handle */
+TIDStore *
+tidstore_attach(dsa_area *area, rt_handle handle)
+{
+ TIDStore *ts;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ ts = palloc0(sizeof(TIDStore));
+ ts->tree = rt_attach(area, handle);
+
+ return ts;
+}
+
+/*
+ * Detach from a TIDStore. This detaches from radix tree and frees the
+ * backend-local resources.
+ */
+void
+tidstore_detach(TIDStore *ts)
+{
+ rt_detach(ts->tree);
+ pfree(ts);
+}
+
+void
+tidstore_free(TIDStore *ts)
+{
+ rt_free(ts->tree);
+ pfree(ts);
+}
+
+void
+tidstore_reset(TIDStore *ts)
+{
+ dsa_area *area = ts->area;
+
+ /* Reset the statistics */
+ ts->num_tids = 0;
+
+ /* Recreate radix tree storage */
+ rt_free(ts->tree);
+ ts->tree = rt_create(CurrentMemoryContext, area);
+}
+
+/* Add TIDs to TIDStore */
+void
+tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 key;
+ uint64 val = 0;
+ ItemPointerData tid;
+
+ ItemPointerSetBlockNumber(&tid, blkno);
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint32 off;
+
+ ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+ key = tid_to_key_off(&tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(ts->tree, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= UINT64CONST(1) << off;
+ ts->num_tids++;
+ }
+
+ if (last_key != PG_UINT64_MAX)
+ {
+ rt_set(ts->tree, last_key, val);
+ val = 0;
+ }
+}
+
+/* Return true if the given TID is present in TIDStore */
+bool
+tidstore_lookup_tid(TIDStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val;
+ uint32 off;
+ bool found;
+
+ key = tid_to_key_off(tid, &off);
+
+ found = rt_search(ts->tree, key, &val);
+
+ if (!found)
+ return false;
+
+ return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+TIDStoreIter *
+tidstore_begin_iterate(TIDStore *ts)
+{
+ TIDStoreIter *iter;
+
+ iter = palloc0(sizeof(TIDStoreIter));
+ iter->ts = ts;
+ iter->tree_iter = rt_begin_iterate(ts->tree);
+ iter->blkno = InvalidBlockNumber;
+
+ return iter;
+}
+
+bool
+tidstore_iterate_next(TIDStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+
+ if (iter->finished)
+ return false;
+
+ if (BlockNumberIsValid(iter->blkno))
+ {
+ iter->num_offsets = 0;
+ tidstore_iter_collect_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (rt_iterate_next(iter->tree_iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = KEY_GET_BLKNO(key);
+
+ if (BlockNumberIsValid(iter->blkno) && iter->blkno != blkno)
+ {
+ /*
+ * Remember the key-value pair for the next block for the
+ * next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+ return true;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_collect_tids(iter, key, val);
+ }
+
+ iter->finished = true;
+ return true;
+}
+
+uint64
+tidstore_num_tids(TIDStore *ts)
+{
+ return ts->num_tids;
+}
+
+uint64
+tidstore_memory_usage(TIDStore *ts)
+{
+ return (uint64) sizeof(TIDStore) + rt_memory_usage(ts->tree);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TIDStore
+ */
+tidstore_handle
+tidstore_get_handle(TIDStore *ts)
+{
+ return rt_get_handle(ts->tree);
+}
+
+/* Extract TIDs from key-value pair */
+static void
+tidstore_iter_collect_tids(TIDStoreIter *iter, uint64 key, uint64 val)
+{
+ for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ if ((val & (UINT64CONST(1) << i)) == 0)
+ continue;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= i;
+
+ off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+ iter->offsets[iter->num_offsets++] = off;
+ }
+
+ iter->blkno = KEY_GET_BLKNO(key);
+}
+
+/* Encode a TID to key and val */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint64 tid_i;
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+ *off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+ upper = tid_i >> TIDSTORE_VALUE_NBITS;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ return upper;
+}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index dfbe37472f..5b013bc3a8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -144,6 +145,8 @@ typedef struct LVRelState
Relation *indrels;
int nindexes;
+ int max_bytes;
+
/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
bool aggressive;
/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -194,7 +197,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TIDStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -265,8 +268,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer *vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer *vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -397,6 +401,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
vacrel->indname = NULL;
vacrel->phase = VACUUM_ERRCB_PHASE_UNKNOWN;
vacrel->verbose = verbose;
+ vacrel->max_bytes = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
errcallback.callback = vacuum_error_callback;
errcallback.arg = vacrel;
errcallback.previous = error_context_stack;
@@ -858,7 +865,7 @@ lazy_scan_heap(LVRelState *vacrel)
next_unskippable_block,
next_failsafe_block = 0,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TIDStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
@@ -872,7 +879,7 @@ lazy_scan_heap(LVRelState *vacrel)
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = vacrel->max_bytes; /* XXX: should use # of tids */
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -942,8 +949,8 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ /* XXX: should not allow tidstore to grow beyond max_bytes */
+ if (tidstore_memory_usage(vacrel->dead_items) > vacrel->max_bytes)
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -1075,11 +1082,17 @@ lazy_scan_heap(LVRelState *vacrel)
if (prunestate.has_lpdead_items)
{
Size freespace;
+ TIDStoreIter *iter;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ tidstore_iterate_next(iter);
+ lazy_vacuum_heap_page(vacrel, blkno, iter->offsets, iter->num_offsets,
+ buf, &vmbuffer);
+ Assert(!tidstore_iterate_next(iter));
+ pfree(iter);
/* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ tidstore_reset(dead_items);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1116,7 +1129,7 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(tidstore_num_tids(dead_items) == 0);
}
/*
@@ -1269,7 +1282,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (tidstore_num_tids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1903,25 +1916,16 @@ retry:
*/
if (lpdead_items > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TIDStore *dead_items = vacrel->dead_items;
Assert(!prunestate->all_visible);
Assert(prunestate->has_lpdead_items);
vacrel->lpdead_item_pages++;
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ tidstore_num_tids(dead_items));
}
/* Finally, add page-local counts to whole-VACUUM counts */
@@ -2128,8 +2132,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TIDStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2138,17 +2141,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- Assert(dead_items->num_items <= dead_items->max_items);
pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ tidstore_num_tids(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2197,7 +2193,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
return;
}
@@ -2226,7 +2222,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2253,8 +2249,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2299,7 +2295,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ /* tidstore_reset(vacrel->dead_items); */
}
/*
@@ -2371,7 +2367,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2408,10 +2404,10 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index;
BlockNumber vacuumed_pages;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TIDStoreIter *iter;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2428,8 +2424,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuumed_pages = 0;
- index = 0;
- while (index < vacrel->dead_items->num_items)
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ while (tidstore_iterate_next(iter))
{
BlockNumber tblk;
Buffer buf;
@@ -2438,12 +2434,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- tblk = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ tblk = iter->blkno;
vacrel->blkno = tblk;
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, tblk, buf, index, &vmbuffer);
+ lazy_vacuum_heap_page(vacrel, tblk, iter->offsets, iter->num_offsets,
+ buf, &vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2467,9 +2464,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
@@ -2491,11 +2487,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* LP_DEAD item on the page. The return value is the first index immediately
* after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer *vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets, Buffer buffer, Buffer *vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int uncnt = 0;
@@ -2514,16 +2509,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = offsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2603,7 +2593,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -3105,46 +3094,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
return vacrel->nonempty_pages;
}
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
/*
* Allocate dead_items (either using palloc, or in dynamic shared memory).
* Sets dead_items in vacrel for caller.
@@ -3155,12 +3104,6 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
-
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
* be used for an index, so we invoke parallelism only if there are at
@@ -3186,7 +3129,6 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3199,11 +3141,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = tidstore_create(NULL);
}
/*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 3c8ea21475..effb72cdd6 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -95,7 +95,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int vac_cmp_itemptr(const void *left, const void *right);
/*
* Primary entry point for manual VACUUM and ANALYZE commands
@@ -2295,16 +2294,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TIDStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ tidstore_num_tids(dead_items))));
return istat;
}
@@ -2335,18 +2334,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
@@ -2357,60 +2344,7 @@ vac_max_items_to_alloc_size(int max_items)
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch((void *) itemptr,
- (void *) dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
-
- return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
- BlockNumber lblk,
- rblk;
- OffsetNumber loff,
- roff;
-
- lblk = ItemPointerGetBlockNumber((ItemPointer) left);
- rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
- if (lblk < rblk)
- return -1;
- if (lblk > rblk)
- return 1;
-
- loff = ItemPointerGetOffsetNumber((ItemPointer) left);
- roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
- if (loff < roff)
- return -1;
- if (loff > roff)
- return 1;
+ TIDStore *dead_items = (TIDStore *) state;
- return 0;
+ return tidstore_lookup_tid(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index f26d796e52..08892c2196 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
* use small integers.
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 3
#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 4
#define PARALLEL_VACUUM_KEY_WAL_USAGE 5
@@ -103,6 +103,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TIDStore */
+ tidstore_handle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,7 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -222,20 +225,22 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -283,9 +288,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -351,6 +355,15 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = tidstore_create(dead_items_dsa);
+ pvs->dead_items = dead_items;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -360,6 +373,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = tidstore_get_handle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +382,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -434,6 +439,8 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ tidstore_free(pvs->dead_items);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -442,7 +449,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TIDStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -940,7 +947,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -984,10 +993,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1042,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ tidstore_detach(pvs.dead_items);
+ dsa_detach(dead_items_area);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 532cd67f4e..d49a052b14 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -183,6 +183,8 @@ static const char *const BuiltinTrancheNames[] = {
"PgStatsHash",
/* LWTRANCHE_PGSTATS_DATA: */
"PgStatsData",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..40b8021f9b
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,55 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * TID storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "lib/radixtree.h"
+#include "storage/itemptr.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TIDStore TIDStore;
+
+typedef struct TIDStoreIter
+{
+ TIDStore *ts;
+
+ rt_iter *tree_iter;
+
+ bool finished;
+
+ uint64 next_key;
+ uint64 next_val;
+
+ BlockNumber blkno;
+ OffsetNumber offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+ int num_offsets;
+} TIDStoreIter;
+
+extern TIDStore *tidstore_create(dsa_area *dsa);
+extern TIDStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TIDStore *ts);
+extern void tidstore_free(TIDStore *ts);
+extern void tidstore_reset(TIDStore *ts);
+extern void tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TIDStore *ts, ItemPointer tid);
+extern TIDStoreIter * tidstore_begin_iterate(TIDStore *ts);
+extern bool tidstore_iterate_next(TIDStoreIter *iter);
+extern uint64 tidstore_num_tids(TIDStore *ts);
+extern uint64 tidstore_memory_usage(TIDStore *ts);
+extern tidstore_handle tidstore_get_handle(TIDStore *ts);
+
+#endif /* TIDSTORE_H */
+
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 5d816ba7f4..d221528f16 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -235,21 +236,6 @@ typedef struct VacuumParams
int nworkers;
} VacuumParams;
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -306,18 +292,16 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TIDStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TIDStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ca4eca76f4..0999e4fc10 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -193,6 +193,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DSA,
LWTRANCHE_PGSTATS_HASH,
LWTRANCHE_PGSTATS_DATA,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
--
2.31.1
v9-0002-Add-radix-implementation.patchapplication/octet-stream; name=v9-0002-Add-radix-implementation.patchDownload
From ac437b4d40cd0e61258fb411e659ddd87de08a1e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v9 2/6] Add radix implementation.
---
src/backend/lib/Makefile | 1 +
src/backend/lib/meson.build | 1 +
src/backend/lib/radixtree.c | 2404 +++++++++++++++++
src/include/lib/radixtree.h | 42 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 28 +
src/test/modules/test_radixtree/meson.build | 34 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 504 ++++
.../test_radixtree/test_radixtree.control | 4 +
15 files changed, 3069 insertions(+)
create mode 100644 src/backend/lib/radixtree.c
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
integerset.o \
knapsack.o \
pairingheap.o \
+ radixtree.o \
rbtree.o \
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 48da1bddce..4303d306cd 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -9,4 +9,5 @@ backend_sources += files(
'knapsack.c',
'pairingheap.c',
'rbtree.c',
+ 'radixtree.c',
)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..bd58b2bfad
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2404 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves". We
+ * choose it to avoid an additional pointer traversal. It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create - Create a new, empty radix tree
+ * rt_free - Free the radix tree
+ * rt_search - Search a key-value pair
+ * rt_set - Set a key-value pair
+ * rt_delete - Delete a key-value pair
+ * rt_begin_iterate - Begin iterating through all key-value pairs
+ * rt_iterate_next - Return next key-value pair, if any
+ * rt_end_iter - End iteration
+ * rt_memory_usage - Get the memory usage
+ * rt_num_entries - Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-128 */
+#define RT_NODE_128_INVALID_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+ RT_ACTION_FIND = 0, /* find the key-value */
+ RT_ACTION_DELETE, /* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree node kinds.
+ *
+ * XXX: These are currently not well chosen. To reduce memory fragmentation
+ * smaller class should optimally fit neatly into the next larger class
+ * (except perhaps at the lowest end). Right now its
+ * 40/40 -> 296/286 -> 1288/1304 -> 2056/2088 bytes for inner nodes and
+ * leaf nodes, respectively, leading to large amount of allocator padding
+ * with aset.c. Hence the use of slab.
+ *
+ * XXX: need to have node-1 until there is no path compression optimization?
+ *
+ * XXX: need to explain why we choose these node types based on benchmark
+ * results etc.
+ */
+#define RT_NODE_KIND_4 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_128 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+ uint8 chunk;
+
+ /* Size kind of the node */
+ uint8 kind;
+} rt_node;
+#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
+#define NODE_HAS_FREE_SLOT(n) \
+ (((rt_node *) (n))->count < rt_node_kind_info[((rt_node *) (n))->kind].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+typedef struct rt_node_base_4
+{
+ rt_node n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+ rt_node n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-128 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 128 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base128
+{
+ rt_node n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+} rt_node_base_128;
+
+typedef struct rt_node_base256
+{
+ rt_node n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * There are separate from inner node size classes for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ * width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+ rt_node_base_4 base;
+
+ /* 4 children, for key chunks */
+ rt_node *children[4];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+ rt_node_base_4 base;
+
+ /* 4 values, for key chunks */
+ uint64 values[4];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+ rt_node_base_32 base;
+
+ /* 32 children, for key chunks */
+ rt_node *children[32];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+ rt_node_base_32 base;
+
+ /* 32 values, for key chunks */
+ uint64 values[32];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_128
+{
+ rt_node_base_128 base;
+
+ /* Slots for 128 children */
+ rt_node *children[128];
+} rt_node_inner_128;
+
+typedef struct rt_node_leaf_128
+{
+ rt_node_base_128 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(128)];
+
+ /* Slots for 128 values */
+ uint64 values[128];
+} rt_node_leaf_128;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+ rt_node_base_256 base;
+
+ /* Slots for 256 children */
+ rt_node *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+ rt_node_base_256 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ uint64 values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information of each size kinds */
+typedef struct rt_node_kind_info_elem
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+
+ /* slab block size */
+ Size inner_blocksize;
+ Size leaf_blocksize;
+} rt_node_kind_info_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * size, (size) * 32)
+static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
+
+ [RT_NODE_KIND_4] = {
+ .name = "radix tree node 4",
+ .fanout = 4,
+ .inner_size = sizeof(rt_node_inner_4),
+ .leaf_size = sizeof(rt_node_leaf_4),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4)),
+ },
+ [RT_NODE_KIND_32] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(rt_node_inner_32),
+ .leaf_size = sizeof(rt_node_leaf_32),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32)),
+ },
+ [RT_NODE_KIND_128] = {
+ .name = "radix tree node 128",
+ .fanout = 128,
+ .inner_size = sizeof(rt_node_inner_128),
+ .leaf_size = sizeof(rt_node_leaf_128),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128)),
+ },
+ [RT_NODE_KIND_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(rt_node_inner_256),
+ .leaf_size = sizeof(rt_node_leaf_256),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+ },
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+ rt_node *node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+ radix_tree *tree;
+
+ /* Track the iteration on nodes of each level */
+ rt_node_iter stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ rt_node *root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+ MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_NODE_KIND_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
+ bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+ rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+ uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+ uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ /* For better code generation */
+ if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ pg_unreachable();
+
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+ uint8 *dst_chunks, uint64 *dst_values, int count)
+{
+ /* For better code generation */
+ if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ pg_unreachable();
+
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_values, src_values, sizeof(uint64) * count);
+}
+
+/* Functions to manipulate inner and leaf node-128 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_128_is_chunk_used(rt_node_base_128 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return (node->children[slot] != NULL);
+}
+
+static inline bool
+node_leaf_128_is_slot_used(rt_node_leaf_128 *node, uint8 slot)
+{
+ Assert(NODE_IS_LEAF(node));
+ return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+static inline rt_node *
+node_inner_128_get_child(rt_node_inner_128 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_128_get_value(rt_node_leaf_128 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(((rt_node_base_128 *) node)->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+static void
+node_inner_128_delete(rt_node_inner_128 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+static void
+node_leaf_128_delete(rt_node_leaf_128 *node, uint8 chunk)
+{
+ int slotpos = node->base.slot_idxs[chunk];
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+ node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+/* Return an unused slot in node-128 */
+static int
+node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
+{
+ int slotpos = 0;
+
+ Assert(!NODE_IS_LEAF(node));
+ while (node_inner_128_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+static int
+node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ /* We iterate over the isset bitmap per byte then check each bit */
+ for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+ {
+ if (node->isset[slotpos] < 0xFF)
+ break;
+ }
+ Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+ slotpos *= BITS_PER_BYTE;
+ while (node_leaf_128_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+static inline void
+node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+{
+ int slotpos;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ /* find unused slot */
+ slotpos = node_inner_128_find_unused_slot(node, chunk);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ /* find unused slot */
+ slotpos = node_leaf_128_find_unused_slot(node, chunk);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+ node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+node_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+static inline void
+node_leaf_128_update(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(node_inner_256_is_chunk_used(node, chunk));
+ return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(node_leaf_256_is_chunk_used(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+ int shift = key_get_shift(key);
+ rt_node *node;
+
+ node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0,
+ shift > 0);
+ tree->max_val = shift_get_max_val(shift);
+ tree->root = node;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
+{
+ rt_node *newnode;
+
+ if (inner)
+ newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_kind_info[kind].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_kind_info[kind].leaf_size);
+
+ newnode->kind = kind;
+ newnode->shift = shift;
+ newnode->chunk = chunk;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_128)
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) newnode;
+
+ memset(n128->slot_idxs, RT_NODE_128_INVALID_IDX, sizeof(n128->slot_idxs));
+ }
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[kind]++;
+#endif
+
+ return newnode;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node *
+rt_copy_node(radix_tree *tree, rt_node *node, int new_kind)
+{
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, new_kind, node->shift, node->chunk,
+ node->shift > 0);
+ newnode->count = node->count;
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->root == node)
+ tree->root = NULL;
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[node->kind]--;
+ Assert(tree->cnt[node->kind] >= 0);
+#endif
+
+ pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+ rt_node *new_child, uint64 key)
+{
+ Assert(old_child->chunk == new_child->chunk);
+ Assert(old_child->shift == new_child->shift);
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new large node */
+ tree->root = new_child;
+ }
+ else
+ {
+ bool replaced PG_USED_FOR_ASSERTS_ONLY;
+
+ replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+ Assert(replaced);
+ }
+
+ rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+ int target_shift;
+ int shift = tree->root->shift + RT_NODE_SPAN;
+
+ target_shift = key_get_shift(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ rt_node_inner_4 *node;
+
+ node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4,
+ shift, 0, true);
+ node->base.n.count = 1;
+ node->base.chunks[0] = 0;
+ node->children[0] = tree->root;
+
+ tree->root->chunk = 0;
+ tree->root = (rt_node *) node;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+ rt_node *node)
+{
+ int shift = node->shift;
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ rt_node *newchild;
+ int newshift = shift - RT_NODE_SPAN;
+
+ newchild = rt_alloc_node(tree, RT_NODE_KIND_4, newshift,
+ RT_GET_KEY_CHUNK(key, node->shift),
+ newshift > 0);
+ rt_node_insert_inner(tree, parent, node, key, newchild);
+
+ parent = node;
+ node = newchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ rt_node_insert_leaf(tree, parent, node, key, value);
+ tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ rt_node *child = NULL;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = n4->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n4->base.chunks, n4->children,
+ n4->base.n.count, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = n32->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = node_inner_128_get_child(n128, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_128_delete(n128, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = node_inner_256_get_child(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ if (found && child_p)
+ *child_p = child;
+
+ return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ uint64 value = 0;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = n4->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+ n4->base.n.count, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = n32->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+ n32->base.n.count, idx);
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_128_get_value(n128, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_128_delete(n128, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_256_get_value(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ if (found && value_p)
+ *value_p = value;
+
+ return found;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+ rt_node *child)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->children[idx] = child;
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+ {
+ rt_node_inner_32 *new32;
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_children_array_copy(n4->base.chunks, n4->children,
+ new32->base.chunks, new32->children,
+ n4->base.n.count);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+ key);
+ node = (rt_node *) new32;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ uint16 count = n4->base.n.count;
+
+ /* shift chunks and children */
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n4->base.chunks, n4->children,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->children[insertpos] = child;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->children[idx] = child;
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+ {
+ rt_node_inner_128 *new128;
+
+ /* grow node from 32 to 128 */
+ new128 = (rt_node_inner_128 *) rt_copy_node(tree, (rt_node *) n32,
+ RT_NODE_KIND_128);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+ key);
+ node = (rt_node *) new128;
+ }
+ else
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int16 count = n32->base.n.count;
+
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n32->base.chunks, n32->children,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->children[insertpos] = child;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_128:
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+ int cnt = 0;
+
+ if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ node_inner_128_update(n128, chunk, child);
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+ {
+ rt_node_inner_256 *new256;
+
+ /* grow node from 128 to 256 */
+ new256 = (rt_node_inner_256 *) rt_copy_node(tree, (rt_node *) n128,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+ {
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ continue;
+
+ node_inner_256_set(new256, i, node_inner_128_get_child(n128, i));
+ cnt++;
+ }
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ else
+ {
+ node_inner_128_insert(n128, chunk, child);
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+
+ node_inner_256_set(n256, chunk, child);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(NODE_IS_LEAF(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->values[idx] = value;
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+ {
+ rt_node_leaf_32 *new32;
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_leaf_32 *) rt_copy_node(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_values_array_copy(n4->base.chunks, n4->values,
+ new32->base.chunks, new32->values,
+ n4->base.n.count);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+ key);
+ node = (rt_node *) new32;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ int count = n4->base.n.count;
+
+ /* shift chunks and values */
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n4->base.chunks, n4->values,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->values[insertpos] = value;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->values[idx] = value;
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+ {
+ rt_node_leaf_128 *new128;
+
+ /* grow node from 32 to 128 */
+ new128 = (rt_node_leaf_128 *) rt_copy_node(tree, (rt_node *) n32,
+ RT_NODE_KIND_128);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+ key);
+ node = (rt_node *) new128;
+ }
+ else
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int count = n32->base.n.count;
+
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n32->base.chunks, n32->values,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->values[insertpos] = value;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_128:
+ {
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+ int cnt = 0;
+
+ if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ node_leaf_128_update(n128, chunk, value);
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+ {
+ rt_node_leaf_256 *new256;
+
+ /* grow node from 128 to 256 */
+ new256 = (rt_node_leaf_256 *) rt_copy_node(tree, (rt_node *) n128,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+ {
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ continue;
+
+ node_leaf_256_set(new256, i, node_leaf_128_get_value(n128, i));
+ cnt++;
+ }
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ else
+ {
+ node_leaf_128_insert(n128, chunk, value);
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+
+ node_leaf_256_set(n256, chunk, value);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+ radix_tree *tree;
+ MemoryContext old_ctx;
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = palloc(sizeof(radix_tree));
+ tree->context = ctx;
+ tree->root = NULL;
+ tree->max_val = 0;
+ tree->num_keys = 0;
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].inner_blocksize,
+ rt_node_kind_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].leaf_blocksize,
+ rt_node_kind_info[i].leaf_size);
+#ifdef RT_DEBUG
+ tree->cnt[i] = 0;
+#endif
+ }
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+ int shift;
+ bool updated;
+ rt_node *node;
+ rt_node *parent = tree->root;
+
+ /* Empty tree, create the root */
+ if (!tree->root)
+ rt_new_root(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->max_val)
+ rt_extend(tree, key);
+
+ Assert(tree->root);
+
+ shift = tree->root->shift;
+ node = tree->root;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ {
+ rt_set_extend(tree, key, value, parent, node);
+ return false;
+ }
+
+ parent = node;
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->num_keys++;
+
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+ rt_node *node;
+ int shift;
+
+ Assert(value_p != NULL);
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ node = tree->root;
+ shift = tree->root->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ rt_node *stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ node = tree->root;
+ shift = tree->root->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ rt_node *child;
+
+ /* Push the current node to the stack */
+ stack[++level] = node;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ Assert(NODE_IS_LEAF(node));
+ deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (!NODE_IS_EMPTY(node))
+ return true;
+
+ /* Free the empty leaf node */
+ rt_free_node(tree, node);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ node = stack[level--];
+
+ deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!NODE_IS_EMPTY(node))
+ break;
+
+ /* The node became empty */
+ rt_free_node(tree, node);
+ }
+
+ /*
+ * If we eventually deleted the root node while recursively deleting empty
+ * nodes, we make the tree empty.
+ */
+ if (level == 0)
+ {
+ tree->root = NULL;
+ tree->max_val = 0;
+ }
+
+ return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+ MemoryContext old_ctx;
+ rt_iter *iter;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (rt_iter *) palloc0(sizeof(rt_iter));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree)
+ return iter;
+
+ top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+ int level = from;
+ rt_node *node = from_node;
+
+ for (;;)
+ {
+ rt_node_iter *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = rt_node_inner_iterate_next(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree)
+ return false;
+
+ for (;;)
+ {
+ rt_node *child = NULL;
+ uint64 value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ rt_update_iter_stack(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+ pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+ rt_node *child = NULL;
+ bool found = false;
+ uint8 key_chunk;
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+
+ child = n4->children[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+ child = n32->children[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_128_get_child(n128, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_inner_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_256_get_child(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+ return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p)
+{
+ rt_node *node = node_iter->node;
+ bool found = false;
+ uint64 value;
+ uint8 key_chunk;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+
+ value = n4->values[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+ value = n32->values[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_128_get_value(n128, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_leaf_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_256_get_value(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ *value_p = value;
+ }
+
+ return found;
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+ return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+ Size total = sizeof(radix_tree);
+
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+ for (int i = 1; i < n4->n.count; i++)
+ Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_128_is_chunk_used(n128, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ if (NODE_IS_LEAF(node))
+ Assert(node_leaf_128_is_slot_used((rt_node_leaf_128 *) node,
+ n128->slot_idxs[i]));
+ else
+ Assert(node_inner_128_is_slot_used((rt_node_inner_128 *) node,
+ n128->slot_idxs[i]));
+
+ cnt++;
+ }
+
+ Assert(n128->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+ cnt += pg_popcount32(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+ ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
+ tree->num_keys,
+ tree->root->shift / RT_NODE_SPAN,
+ tree->cnt[0],
+ tree->cnt[1],
+ tree->cnt[2],
+ tree->cnt[3])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+ char space[128] = {0};
+
+ fprintf(stderr, "[%s] kind %d, count %u, shift %u, chunk 0x%X:\n",
+ NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_128) ? 128 : 256,
+ node->count, node->shift, node->chunk);
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+ space, n4->base.chunks[i], n4->values[i]);
+ }
+ else
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n4->base.chunks[i]);
+
+ if (recurse)
+ rt_dump_node(n4->children[i], level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+ space, n32->base.chunks[i], n32->values[i]);
+ }
+ else
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ rt_dump_node(n32->children[i], level + 1, recurse);
+ }
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *b128 = (rt_node_base_128 *) node;
+
+ fprintf(stderr, "slot_idxs ");
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_128_is_chunk_used(b128, i))
+ continue;
+
+ fprintf(stderr, " [%d]=%d, ", i, b128->slot_idxs[i]);
+ }
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_128 *n = (rt_node_leaf_128 *) node;
+
+ fprintf(stderr, ", isset-bitmap:");
+ for (int i = 0; i < 16; i++)
+ {
+ fprintf(stderr, "%X ", (uint8) n->isset[i]);
+ }
+ fprintf(stderr, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_128_is_chunk_used(b128, i))
+ continue;
+
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) b128;
+
+ fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+ space, i, node_leaf_128_get_value(n128, i));
+ }
+ else
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) b128;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_128_get_child(n128, i),
+ level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+ space, i, node_leaf_256_get_value(n256, i));
+ }
+ else
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+ recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+ if (!tree->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->max_val)
+ {
+ elog(NOTICE, "key %lu (0x%lX) is larger than max val",
+ key, key);
+ return;
+ }
+
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ rt_dump_node(node, level, false);
+
+ if (NODE_IS_LEAF(node))
+ {
+ uint64 dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+ break;
+ }
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].inner_size,
+ rt_node_kind_info[i].inner_blocksize,
+ rt_node_kind_info[i].leaf_size,
+ rt_node_kind_info[i].leaf_blocksize);
+ fprintf(stderr, "max_val = %lu\n", tree->max_val);
+
+ if (!tree->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif /* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 7b3f292965..e587cabe13 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -26,6 +26,7 @@ SUBDIRS = \
test_parser \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index c2e5f5ffd5..c86f6bdcb0 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -20,6 +20,7 @@ subdir('test_oat_hooks')
subdir('test_parser')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..cc6970c87c
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,28 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..cb3596755d
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,504 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+/* The maximum number of entries each node type can have */
+static int rt_node_max_entries[] = {
+ 4, /* RT_NODE_KIND_4 */
+ 16, /* RT_NODE_KIND_16 */
+ 32, /* RT_NODE_KIND_32 */
+ 128, /* RT_NODE_KIND_128 */
+ 256 /* RT_NODE_KIND_256 */
+};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 10000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ radix_tree *radixtree;
+ uint64 dummy;
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ uint64 val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, key);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+ for (int j = 0; j < lengthof(rt_node_max_entries); j++)
+ {
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (i == (rt_node_max_entries[j] - 1))
+ {
+ check_search_on_node(radixtree, shift,
+ (j == 0) ? 0 : rt_node_max_entries[j - 1],
+ rt_node_max_entries[j]);
+ break;
+ }
+ }
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "inserted key 0x" UINT64_HEX_FORMAT " is not found", key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ radix_tree *radixtree;
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift);
+
+ rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+ radixtree = rt_create(radixtree_ctx);
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
--
2.31.1
v9-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/octet-stream; name=v9-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload
From c8918d78d679fabe40a2855ba4d9ea0d1dbb5445 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v9 1/6] introduce vector8_min and vector8_highbit_mask
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..0b288c422a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
static inline bool vector8_has_zero(const Vector8 v);
static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
#endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
#endif
}
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+ uint32 mask = 0;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+ return mask;
+#endif
+}
+
/*
* Exactly like vector8_is_highbit_set except for the input type, so it
* looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.31.1
On Mon, Nov 14, 2022 at 3:44 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
0004 patch is a new patch supporting a pointer tagging of the node
kind. Also, it introduces rt_node_ptr we discussed so that internal
functions use it rather than having two arguments for encoded and
decoded pointers. With this intermediate patch, the DSA support patch
became more readable and understandable. Probably we can make it
smaller further if we move the change of separating the control object
from radix_tree to the main patch (0002). The patch still needs to be
polished but I'd like to check if this idea is worthwhile. If we agree
on this direction, this patch will be merged into the main radix tree
implementation patch.
Thanks for the new patch set. I've taken a very brief look at 0004 and I
think the broad outlines are okay. As you say it needs polish, but before
going further, I'd like to do some experiments of my own as I mentioned
earlier:
- See how much performance we actually gain from tagging the node kind.
- Try additional size classes while keeping the node kinds to only four.
- Optimize node128 insert.
- Try templating out the differences between local and shared memory. With
local memory, the node-pointer struct would be a union, for example.
Templating would also reduce branches and re-simplify some internal APIs,
but it's likely that would also make the TID store and/or vacuum more
complex, because at least some external functions would be duplicated.
I'll set the patch to "waiting on author", but in this case the author is
me.
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Nov 14, 2022 at 10:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Mon, Nov 14, 2022 at 3:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
0004 patch is a new patch supporting a pointer tagging of the node
kind. Also, it introduces rt_node_ptr we discussed so that internal
functions use it rather than having two arguments for encoded and
decoded pointers. With this intermediate patch, the DSA support patch
became more readable and understandable. Probably we can make it
smaller further if we move the change of separating the control object
from radix_tree to the main patch (0002). The patch still needs to be
polished but I'd like to check if this idea is worthwhile. If we agree
on this direction, this patch will be merged into the main radix tree
implementation patch.Thanks for the new patch set. I've taken a very brief look at 0004 and I think the broad outlines are okay. As you say it needs polish, but before going further, I'd like to do some experiments of my own as I mentioned earlier:
- See how much performance we actually gain from tagging the node kind.
- Try additional size classes while keeping the node kinds to only four.
- Optimize node128 insert.
- Try templating out the differences between local and shared memory. With local memory, the node-pointer struct would be a union, for example. Templating would also reduce branches and re-simplify some internal APIs, but it's likely that would also make the TID store and/or vacuum more complex, because at least some external functions would be duplicated.
Thanks! Please let me know if there is something I can help with.
In the meanwhile, I'd like to make some progress on the vacuum
integration and improving the test coverages.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
Thanks! Please let me know if there is something I can help with.
I didn't get very far because the tests fail on 0004 in rt_verify_node:
TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File:
"../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Nov 16, 2022 at 11:46 AM John Naylor <john.naylor@enterprisedb.com>
wrote:
On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
Thanks! Please let me know if there is something I can help with.
I didn't get very far because the tests fail on 0004 in rt_verify_node:
TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File:
"../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
Actually I do want to offer some general advice. Upthread I recommended a
purely refactoring patch that added the node-pointer struct but did nothing
else, so that the DSA changes would be smaller. 0004 attempted pointer
tagging in the same commit, which makes it no longer a purely refactoring
patch, so that 1) makes it harder to tell what part caused the bug and 2)
obscures what is necessary for DSA pointers and what was additionally
necessary for pointer tagging. Shared memory support is a prerequisite for
a shippable feature, but pointer tagging is (hopefully) a performance
optimization. Let's keep them separate.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Nov 16, 2022 at 1:46 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Thanks! Please let me know if there is something I can help with.
I didn't get very far because the tests fail on 0004 in rt_verify_node:
TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
Which tests do you use to get this assertion failure? I've confirmed
there is a bug in 0005 patch but without it, "make check-world"
passed.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Wed, Nov 16, 2022 at 2:17 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Wed, Nov 16, 2022 at 11:46 AM John Naylor <john.naylor@enterprisedb.com> wrote:
On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Thanks! Please let me know if there is something I can help with.
I didn't get very far because the tests fail on 0004 in rt_verify_node:
TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
Actually I do want to offer some general advice. Upthread I recommended a purely refactoring patch that added the node-pointer struct but did nothing else, so that the DSA changes would be smaller. 0004 attempted pointer tagging in the same commit, which makes it no longer a purely refactoring patch, so that 1) makes it harder to tell what part caused the bug and 2) obscures what is necessary for DSA pointers and what was additionally necessary for pointer tagging. Shared memory support is a prerequisite for a shippable feature, but pointer tagging is (hopefully) a performance optimization. Let's keep them separate.
Totally agreed. I'll separate them in the next version patch. Thank
you for your advice.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Wed, Nov 16, 2022 at 1:46 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
Thanks! Please let me know if there is something I can help with.
I didn't get very far because the tests fail on 0004 in rt_verify_node:
TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File:
"../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
Which tests do you use to get this assertion failure? I've confirmed
there is a bug in 0005 patch but without it, "make check-world"
passed.
Hmm, I started over and rebuilt and it didn't reproduce. Not sure what
happened, sorry for the noise.
I'm attaching a test I wrote to stress test branch prediction in search,
and while trying it out I found two possible issues.
It's based on the random int load test, but tests search speed. Run like
this:
select * from bench_search_random_nodes(10 * 1000 * 1000)
It also takes some care to include all the different node kinds,
restricting the possible keys by AND-ing with a filter. Here's a simple
demo:
filter = ((uint64)1<<40)-1;
LOG: num_keys = 9999967, height = 4, n4 = 17513814, n32 = 6320, n128 =
62663, n256 = 3130
Just using random integers leads to >99% using the smallest node. I wanted
to get close to having the same number of each, but that's difficult while
still using random inputs. I ended up using
filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)
which gives
LOG: num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 =
182670, n256 = 1024
Which seems okay for the task. One puzzling thing I found while trying
various filters is that sometimes the reported tree height would change.
For example:
filter = (((uint64) 1<<32) | (0xFF<<24));
LOG: num_keys = 9999944, height = 7, n4 = 47515559, n32 = 6209, n128 =
62632, n256 = 3161
1) Any idea why the tree height would be reported as 7 here? I didn't
expect that.
2) It seems that 0004 actually causes a significant slowdown in this test
(as in the attached, using the second filter above and with turboboost
disabled):
v9 0003: 2062 2051 2050
v9 0004: 2346 2316 2321
That means my idea for the pointer struct might have some problems, at
least as currently implemented. Maybe in the course of separating out and
polishing that piece, an inefficiency will fall out. Or, it might be
another reason to template local and shared separately. Not sure yet. I
also haven't tried to adjust this test for the shared memory case.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
add-random-node-search-test.patch.txttext/plain; charset=US-ASCII; name=add-random-node-search-test.patch.txtDownload
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 0874201d7e..e0205b364e 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -43,6 +43,14 @@ returns record
as 'MODULE_PATHNAME'
LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+create function bench_search_random_nodes(
+cnt int8,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
create function bench_fixed_height_search(
fanout int4,
OUT fanout int4,
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 7abb237e96..a43fc61c2d 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -29,6 +29,7 @@ PG_FUNCTION_INFO_V1(bench_seq_search);
PG_FUNCTION_INFO_V1(bench_shuffle_search);
PG_FUNCTION_INFO_V1(bench_load_random_int);
PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
static uint64
tid_to_key_off(ItemPointer tid, uint32 *off)
@@ -347,6 +348,77 @@ bench_load_random_int(PG_FUNCTION_ARGS)
PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
}
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 search_time_ms;
+ Datum values[2] = {0};
+ bool nulls[2] = {0};
+ /* from trial and error */
+ const uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+
+ rt_set(rt, key, key);
+ }
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
Datum
bench_fixed_height_search(PG_FUNCTION_ARGS)
{
On Wed, Nov 16, 2022 at 4:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Nov 16, 2022 at 1:46 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Thanks! Please let me know if there is something I can help with.
I didn't get very far because the tests fail on 0004 in rt_verify_node:
TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
Which tests do you use to get this assertion failure? I've confirmed
there is a bug in 0005 patch but without it, "make check-world"
passed.Hmm, I started over and rebuilt and it didn't reproduce. Not sure what happened, sorry for the noise.
Good to know. No problem.
I'm attaching a test I wrote to stress test branch prediction in search, and while trying it out I found two possible issues.
Thank you for testing!
It's based on the random int load test, but tests search speed. Run like this:
select * from bench_search_random_nodes(10 * 1000 * 1000)
It also takes some care to include all the different node kinds, restricting the possible keys by AND-ing with a filter. Here's a simple demo:
filter = ((uint64)1<<40)-1;
LOG: num_keys = 9999967, height = 4, n4 = 17513814, n32 = 6320, n128 = 62663, n256 = 3130Just using random integers leads to >99% using the smallest node. I wanted to get close to having the same number of each, but that's difficult while still using random inputs. I ended up using
filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)
which gives
LOG: num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 = 182670, n256 = 1024
Which seems okay for the task. One puzzling thing I found while trying various filters is that sometimes the reported tree height would change. For example:
filter = (((uint64) 1<<32) | (0xFF<<24));
LOG: num_keys = 9999944, height = 7, n4 = 47515559, n32 = 6209, n128 = 62632, n256 = 31611) Any idea why the tree height would be reported as 7 here? I didn't expect that.
In my environment, (0xFF<<24) is 0xFFFFFFFFFF000000, not 0xFF000000.
It seems the filter should be (((uint64) 1<<32) | ((uint64)
0xFF<<24)).
2) It seems that 0004 actually causes a significant slowdown in this test (as in the attached, using the second filter above and with turboboost disabled):
v9 0003: 2062 2051 2050
v9 0004: 2346 2316 2321That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in the course of separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to template local and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case.
I'll also run the test on my environment and do the investigation tomorrow.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Wed, Sep 28, 2022 at 1:18 PM I wrote:
Along those lines, one thing I've been thinking about is the number of
size classes. There is a tradeoff between memory efficiency and number of
branches when searching/inserting. My current thinking is there is too much
coupling between size class and data type. Each size class currently uses a
different data type and a different algorithm to search and set it, which
in turn requires another branch. We've found that a larger number of size
classes leads to poor branch prediction [1] and (I imagine) code density.
I'm thinking we can use "flexible array members" for the values/pointers,
and keep the rest of the control data in the struct the same. That way, we
never have more than 4 actual "kinds" to code and branch on. As a bonus,
when migrating a node to a larger size class of the same kind, we can
simply repalloc() to the next size.
While the most important challenge right now is how to best represent and
organize the shared memory case, I wanted to get the above idea working and
out of the way, to be saved for a future time. I've attached a rough
implementation (applies on top of v9 0003) that splits node32 into 2 size
classes. They both share the exact same base data type and hence the same
search/set code, so the number of "kind"s is still four, but here there are
five "size classes", so a new case in the "unlikely" node-growing path. The
smaller instance of node32 is a "node15", because that's currently 160
bytes, corresponding to one of the DSA size classes. This idea can be
applied to any other node except the max size, as we see fit. (Adding a
singleton size class would bring it back in line with the prototype, at
least as far as memory consumption.)
One issue with this patch: The "fanout" member is a uint8, so it can't hold
256 for the largest node kind. That's not an issue in practice, since we
never need to grow it, and we only compare that value with the count in an
Assert(), so I just set it to zero. That does break an invariant, so it's
not great. We could use 2 bytes to be strictly correct in all cases, but
that limits what we can do with the smallest node kind.
In the course of working on this, I encountered a pain point. Since it's
impossible to repalloc in slab, we have to do alloc/copy/free ourselves.
That's fine, but the current coding makes too many assumptions about the
use cases: rt_alloc_node and rt_copy_node are too entangled with each other
and do too much work unrelated to what the names imply. I seem to remember
an earlier version had something like rt_node_copy_common that did
only...copying. That was much easier to reason about. In 0002 I resorted to
doing my own allocation to show what I really want to do, because the new
use case doesn't need zeroing and setting values. It only needs
to...allocate (and increase the stats counter if built that way).
Future optimization work while I'm thinking of it: rt_alloc_node should be
always-inlined and the memset done separately (i.e. not *AllocZero). That
way the compiler should be able generate more efficient zeroing code for
smaller nodes. I'll test the numbers on this sometime in the future.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v901-0002-Make-node32-variable-sized.patch.txttext/plain; charset=US-ASCII; name=v901-0002-Make-node32-variable-sized.patch.txtDownload
From 6fcc970ae7e31f44fa6b6aface983cadb023cc50 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Thu, 17 Nov 2022 16:10:44 +0700
Subject: [PATCH v901 2/2] Make node32 variable sized
Add a size class for 15 elements, which corresponds to 160 bytes,
an allocation size used by DSA. When a 16th element is to be
inserted, allocte a larger area and memcpy the entire old node
to it.
NB: Zeroing the new area is only necessary if it's for an
inner node128, since insert logic must check for null child
pointers.
This technique allows us to limit the node kinds to 4, which
1. limits the number of cases in switch statements
2. allows a possible future optimization to encode the node kind
in a pointer tag
---
src/backend/lib/radixtree.c | 141 +++++++++++++++++++++++++++---------
1 file changed, 108 insertions(+), 33 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index bef1a438ab..f368e750d5 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -130,6 +130,7 @@ typedef enum
typedef enum rt_size_class
{
RT_CLASS_4_FULL = 0,
+ RT_CLASS_32_PARTIAL,
RT_CLASS_32_FULL,
RT_CLASS_128_FULL,
RT_CLASS_256
@@ -147,6 +148,8 @@ typedef struct rt_node
uint16 count;
/* Max number of children. We can use uint8 because we never need to store 256 */
+ /* WIP: if we don't have a variable sized node4, this should instead be in the base
+ types as needed, since saving every byte is crucial for the smallest node kind */
uint8 fanout;
/*
@@ -166,6 +169,8 @@ typedef struct rt_node
((node)->base.n.count < (node)->base.n.fanout)
/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
typedef struct rt_node_base_4
{
rt_node n;
@@ -217,40 +222,40 @@ typedef struct rt_node_inner_4
{
rt_node_base_4 base;
- /* 4 children, for key chunks */
- rt_node *children[4];
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
} rt_node_inner_4;
typedef struct rt_node_leaf_4
{
rt_node_base_4 base;
- /* 4 values, for key chunks */
- uint64 values[4];
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
} rt_node_leaf_4;
typedef struct rt_node_inner_32
{
rt_node_base_32 base;
- /* 32 children, for key chunks */
- rt_node *children[32];
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
} rt_node_inner_32;
typedef struct rt_node_leaf_32
{
rt_node_base_32 base;
- /* 32 values, for key chunks */
- uint64 values[32];
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
} rt_node_leaf_32;
typedef struct rt_node_inner_128
{
rt_node_base_128 base;
- /* Slots for 128 children */
- rt_node *children[128];
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
} rt_node_inner_128;
typedef struct rt_node_leaf_128
@@ -260,8 +265,8 @@ typedef struct rt_node_leaf_128
/* isset is a bitmap to track which slot is in use */
uint8 isset[RT_NODE_NSLOTS_BITS(128)];
- /* Slots for 128 values */
- uint64 values[128];
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
} rt_node_leaf_128;
/*
@@ -307,32 +312,40 @@ typedef struct rt_size_class_elem
* from the block.
*/
#define NODE_SLAB_BLOCK_SIZE(size) \
- Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * size, (size) * 32)
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
[RT_CLASS_4_FULL] = {
.name = "radix tree node 4",
.fanout = 4,
- .inner_size = sizeof(rt_node_inner_4),
- .leaf_size = sizeof(rt_node_leaf_4),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4)),
+ .inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_PARTIAL] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
},
[RT_CLASS_32_FULL] = {
.name = "radix tree node 32",
.fanout = 32,
- .inner_size = sizeof(rt_node_inner_32),
- .leaf_size = sizeof(rt_node_leaf_32),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32)),
+ .inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
},
[RT_CLASS_128_FULL] = {
.name = "radix tree node 128",
.fanout = 128,
- .inner_size = sizeof(rt_node_inner_128),
- .leaf_size = sizeof(rt_node_leaf_128),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128)),
+ .inner_size = sizeof(rt_node_inner_128) + 128 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_128) + 128 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128) + 128 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128) + 128 * sizeof(uint64)),
},
[RT_CLASS_256] = {
.name = "radix tree node 256",
@@ -922,7 +935,6 @@ rt_free_node(radix_tree *tree, rt_node *node)
#ifdef RT_DEBUG
/* update the statistics */
- // FIXME
for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
if (node->fanout == rt_size_class_info[i].fanout)
@@ -1240,7 +1252,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
/* grow node from 4 to 32 */
new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
- RT_NODE_KIND_32, RT_CLASS_32_FULL);
+ RT_NODE_KIND_32, RT_CLASS_32_PARTIAL);
chunk_children_array_copy(n4->base.chunks, n4->children,
new32->base.chunks, new32->children,
n4->base.n.count);
@@ -1282,6 +1294,37 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
{
+ Assert(parent != NULL);
+
+ if (n32->base.n.fanout == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+ {
+ /* use the same node kind, but expand to the next size class */
+
+ /* no need to zero the new memory */
+ rt_node_inner_32 *new32 =
+ (rt_node_inner_32 *) MemoryContextAlloc(tree->inner_slabs[RT_CLASS_32_FULL],
+ rt_size_class_info[RT_CLASS_32_FULL].inner_size);
+
+// FIXME the API for rt_alloc_node and rt_node_copy are too entangled
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[RT_CLASS_32_FULL]++;
+#endif
+ /* copy the entire old node -- the new node is only different in having
+ additional slots so we only have to change the fanout */
+ memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size);
+ new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32,
+ key);
+
+ /* must update both pointers here */
+ node = (rt_node *) new32;
+ n32 = new32;
+ goto retry_insert_inner_32;
+ }
+ else
+ {
rt_node_inner_128 *new128;
/* grow node from 32 to 128 */
@@ -1290,13 +1333,14 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
for (int i = 0; i < n32->base.n.count; i++)
node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
- Assert(parent != NULL);
rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
key);
node = (rt_node *) new128;
+ }
}
else
{
+retry_insert_inner_32:
int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
int16 count = n32->base.n.count;
@@ -1409,12 +1453,10 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* grow node from 4 to 32 */
new32 = (rt_node_leaf_32 *) rt_copy_node(tree, (rt_node *) n4,
- RT_NODE_KIND_32, RT_CLASS_32_FULL);
+ RT_NODE_KIND_32, RT_CLASS_32_PARTIAL);
chunk_values_array_copy(n4->base.chunks, n4->values,
new32->base.chunks, new32->values,
n4->base.n.count);
-
- Assert(parent != NULL);
rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
key);
node = (rt_node *) new32;
@@ -1451,6 +1493,37 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
{
+ Assert(parent != NULL);
+
+ if (n32->base.n.fanout == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+ {
+ /* use the same node kind, but expand to the next size class */
+
+ /* no need to zero the new memory */
+ rt_node_leaf_32 *new32 =
+ (rt_node_leaf_32 *) MemoryContextAlloc(tree->leaf_slabs[RT_CLASS_32_FULL],
+ rt_size_class_info[RT_CLASS_32_FULL].leaf_size);
+
+// FIXME the API for rt_alloc_node and rt_node_copy are too entangled
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[RT_CLASS_32_FULL]++;
+#endif
+ /* copy the entire old node -- the new node is only different in having
+ additional slots so we only have to change the fanout */
+ memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size);
+ new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32,
+ key);
+
+ /* must update both pointers here */
+ node = (rt_node *) new32;
+ n32 = new32;
+ goto retry_insert_leaf_32;
+ }
+ else
+ {
rt_node_leaf_128 *new128;
/* grow node from 32 to 128 */
@@ -1459,13 +1532,14 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
for (int i = 0; i < n32->base.n.count; i++)
node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
- Assert(parent != NULL);
rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
key);
node = (rt_node *) new128;
+ }
}
else
{
+retry_insert_leaf_32:
int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
int count = n32->base.n.count;
@@ -2189,10 +2263,11 @@ rt_verify_node(rt_node *node)
void
rt_stats(radix_tree *tree)
{
- ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
+ ereport(NOTICE, (errmsg("num_keys = %lu, height = %u, n4 = %u, n15 = %u, n32 = %u, n128 = %u, n256 = %u",
tree->num_keys,
tree->root->shift / RT_NODE_SPAN,
tree->cnt[RT_CLASS_4_FULL],
+ tree->cnt[RT_CLASS_32_PARTIAL],
tree->cnt[RT_CLASS_32_FULL],
tree->cnt[RT_CLASS_128_FULL],
tree->cnt[RT_CLASS_256])));
--
2.38.1
v901-0001-Preparatory-refactoring-for-decoupling-kind-fro.patch.txttext/plain; charset=US-ASCII; name=v901-0001-Preparatory-refactoring-for-decoupling-kind-fro.patch.txtDownload
From 15e16df13912d265c3b1eda858456de6fe595c33 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Thu, 17 Nov 2022 12:10:31 +0700
Subject: [PATCH v901 1/2] Preparatory refactoring for decoupling kind from
size class
Rename the current kind info array to refer to size classes, but
keep all the contents the same.
Add a fanout member to all nodes which stores the max capacity of
the node. This is currently set with the same hardcoded value as
in the kind info array.
In passing, remove outdated reference to node16 in the regression
test.
---
src/backend/lib/radixtree.c | 147 +++++++++++-------
.../modules/test_radixtree/test_radixtree.c | 1 -
2 files changed, 87 insertions(+), 61 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index bd58b2bfad..bef1a438ab 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -127,6 +127,16 @@ typedef enum
#define RT_NODE_KIND_256 0x03
#define RT_NODE_KIND_COUNT 4
+typedef enum rt_size_class
+{
+ RT_CLASS_4_FULL = 0,
+ RT_CLASS_32_FULL,
+ RT_CLASS_128_FULL,
+ RT_CLASS_256
+
+#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
+} rt_size_class;
+
/* Common type for all nodes types */
typedef struct rt_node
{
@@ -136,6 +146,9 @@ typedef struct rt_node
*/
uint16 count;
+ /* Max number of children. We can use uint8 because we never need to store 256 */
+ uint8 fanout;
+
/*
* Shift indicates which part of the key space is represented by this
* node. That is, the key is shifted by 'shift' and the lowest
@@ -144,13 +157,13 @@ typedef struct rt_node
uint8 shift;
uint8 chunk;
- /* Size kind of the node */
+ /* Node kind, one per search/set algorithm */
uint8 kind;
} rt_node;
#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
-#define NODE_HAS_FREE_SLOT(n) \
- (((rt_node *) (n))->count < rt_node_kind_info[((rt_node *) (n))->kind].fanout)
+#define NODE_HAS_FREE_SLOT(node) \
+ ((node)->base.n.count < (node)->base.n.fanout)
/* Base type of each node kinds for leaf and inner nodes */
typedef struct rt_node_base_4
@@ -190,7 +203,7 @@ typedef struct rt_node_base256
/*
* Inner and leaf nodes.
*
- * There are separate from inner node size classes for two main reasons:
+ * Theres are separate for two main reasons:
*
* 1) the value type might be different than something fitting into a pointer
* width type
@@ -274,8 +287,8 @@ typedef struct rt_node_leaf_256
uint64 values[RT_NODE_MAX_SLOTS];
} rt_node_leaf_256;
-/* Information of each size kinds */
-typedef struct rt_node_kind_info_elem
+/* Information for each size class */
+typedef struct rt_size_class_elem
{
const char *name;
int fanout;
@@ -287,7 +300,7 @@ typedef struct rt_node_kind_info_elem
/* slab block size */
Size inner_blocksize;
Size leaf_blocksize;
-} rt_node_kind_info_elem;
+} rt_size_class_elem;
/*
* Calculate the slab blocksize so that we can allocate at least 32 chunks
@@ -295,9 +308,9 @@ typedef struct rt_node_kind_info_elem
*/
#define NODE_SLAB_BLOCK_SIZE(size) \
Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * size, (size) * 32)
-static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
+static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
- [RT_NODE_KIND_4] = {
+ [RT_CLASS_4_FULL] = {
.name = "radix tree node 4",
.fanout = 4,
.inner_size = sizeof(rt_node_inner_4),
@@ -305,7 +318,7 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4)),
.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4)),
},
- [RT_NODE_KIND_32] = {
+ [RT_CLASS_32_FULL] = {
.name = "radix tree node 32",
.fanout = 32,
.inner_size = sizeof(rt_node_inner_32),
@@ -313,7 +326,7 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32)),
.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32)),
},
- [RT_NODE_KIND_128] = {
+ [RT_CLASS_128_FULL] = {
.name = "radix tree node 128",
.fanout = 128,
.inner_size = sizeof(rt_node_inner_128),
@@ -321,9 +334,11 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128)),
.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128)),
},
- [RT_NODE_KIND_256] = {
+ [RT_CLASS_256] = {
.name = "radix tree node 256",
- .fanout = 256,
+ /* technically it's 256, but we can't store that in a uint8,
+ and this is the max size class so it will never grow */
+ .fanout = 0,
.inner_size = sizeof(rt_node_inner_256),
.leaf_size = sizeof(rt_node_leaf_256),
.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
@@ -372,17 +387,17 @@ struct radix_tree
uint64 max_val;
uint64 num_keys;
- MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
- MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
/* statistics */
#ifdef RT_DEBUG
- int32 cnt[RT_NODE_KIND_COUNT];
+ int32 cnt[RT_SIZE_CLASS_COUNT];
#endif
};
static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node *rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
+static rt_node *rt_alloc_node(radix_tree *tree, int kind, rt_size_class size_class, uint8 shift, uint8 chunk,
bool inner);
static void rt_free_node(radix_tree *tree, rt_node *node);
static void rt_extend(radix_tree *tree, uint64 key);
@@ -584,7 +599,7 @@ chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
uint8 *dst_chunks, rt_node **dst_children, int count)
{
/* For better code generation */
- if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ if (count > rt_size_class_info[RT_CLASS_4_FULL].fanout)
pg_unreachable();
memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
@@ -596,7 +611,7 @@ chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
uint8 *dst_chunks, uint64 *dst_values, int count)
{
/* For better code generation */
- if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ if (count > rt_size_class_info[RT_CLASS_4_FULL].fanout)
pg_unreachable();
memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
@@ -837,7 +852,7 @@ rt_new_root(radix_tree *tree, uint64 key)
int shift = key_get_shift(key);
rt_node *node;
- node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0,
+ node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, RT_CLASS_4_FULL, shift, 0,
shift > 0);
tree->max_val = shift_get_max_val(shift);
tree->root = node;
@@ -847,18 +862,19 @@ rt_new_root(radix_tree *tree, uint64 key)
* Allocate a new node with the given node kind.
*/
static rt_node *
-rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
+rt_alloc_node(radix_tree *tree, int kind, rt_size_class size_class, uint8 shift, uint8 chunk, bool inner)
{
rt_node *newnode;
if (inner)
- newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
- rt_node_kind_info[kind].inner_size);
+ newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[size_class],
+ rt_size_class_info[size_class].inner_size);
else
- newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
- rt_node_kind_info[kind].leaf_size);
+ newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[size_class],
+ rt_size_class_info[size_class].leaf_size);
newnode->kind = kind;
+ newnode->fanout = rt_size_class_info[size_class].fanout;
newnode->shift = shift;
newnode->chunk = chunk;
@@ -872,7 +888,7 @@ rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
#ifdef RT_DEBUG
/* update the statistics */
- tree->cnt[kind]++;
+ tree->cnt[size_class]++;
#endif
return newnode;
@@ -883,11 +899,11 @@ rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
* count of 'node'.
*/
static rt_node *
-rt_copy_node(radix_tree *tree, rt_node *node, int new_kind)
+rt_copy_node(radix_tree *tree, rt_node *node, int new_kind, rt_size_class new_size_class)
{
rt_node *newnode;
- newnode = rt_alloc_node(tree, new_kind, node->shift, node->chunk,
+ newnode = rt_alloc_node(tree, new_kind, new_size_class, node->shift, node->chunk,
node->shift > 0);
newnode->count = node->count;
@@ -898,14 +914,22 @@ rt_copy_node(radix_tree *tree, rt_node *node, int new_kind)
static void
rt_free_node(radix_tree *tree, rt_node *node)
{
+ int i;
+
/* If we're deleting the root node, make the tree empty */
if (tree->root == node)
tree->root = NULL;
#ifdef RT_DEBUG
/* update the statistics */
- tree->cnt[node->kind]--;
- Assert(tree->cnt[node->kind] >= 0);
+ // FIXME
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == rt_size_class_info[i].fanout)
+ break;
+ }
+ tree->cnt[i]--;
+ Assert(tree->cnt[i] >= 0);
#endif
pfree(node);
@@ -954,7 +978,7 @@ rt_extend(radix_tree *tree, uint64 key)
{
rt_node_inner_4 *node;
- node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4,
+ node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4, RT_CLASS_4_FULL,
shift, 0, true);
node->base.n.count = 1;
node->base.chunks[0] = 0;
@@ -984,7 +1008,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
rt_node *newchild;
int newshift = shift - RT_NODE_SPAN;
- newchild = rt_alloc_node(tree, RT_NODE_KIND_4, newshift,
+ newchild = rt_alloc_node(tree, RT_NODE_KIND_4, RT_CLASS_4_FULL, newshift,
RT_GET_KEY_CHUNK(key, node->shift),
newshift > 0);
rt_node_insert_inner(tree, parent, node, key, newchild);
@@ -1216,7 +1240,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
/* grow node from 4 to 32 */
new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ RT_NODE_KIND_32, RT_CLASS_32_FULL);
chunk_children_array_copy(n4->base.chunks, n4->children,
new32->base.chunks, new32->children,
n4->base.n.count);
@@ -1262,7 +1286,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
/* grow node from 32 to 128 */
new128 = (rt_node_inner_128 *) rt_copy_node(tree, (rt_node *) n32,
- RT_NODE_KIND_128);
+ RT_NODE_KIND_128, RT_CLASS_128_FULL);
for (int i = 0; i < n32->base.n.count; i++)
node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
@@ -1305,7 +1329,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
/* grow node from 128 to 256 */
new256 = (rt_node_inner_256 *) rt_copy_node(tree, (rt_node *) n128,
- RT_NODE_KIND_256);
+ RT_NODE_KIND_256, RT_CLASS_256);
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
{
if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
@@ -1332,7 +1356,8 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
- Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+ Assert(n256->base.n.fanout == 0);
+ Assert(chunk_exists || ((rt_node *) n256)->count < RT_NODE_MAX_SLOTS);
node_inner_256_set(n256, chunk, child);
break;
@@ -1384,7 +1409,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* grow node from 4 to 32 */
new32 = (rt_node_leaf_32 *) rt_copy_node(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ RT_NODE_KIND_32, RT_CLASS_32_FULL);
chunk_values_array_copy(n4->base.chunks, n4->values,
new32->base.chunks, new32->values,
n4->base.n.count);
@@ -1430,7 +1455,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* grow node from 32 to 128 */
new128 = (rt_node_leaf_128 *) rt_copy_node(tree, (rt_node *) n32,
- RT_NODE_KIND_128);
+ RT_NODE_KIND_128, RT_CLASS_128_FULL);
for (int i = 0; i < n32->base.n.count; i++)
node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
@@ -1473,7 +1498,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* grow node from 128 to 256 */
new256 = (rt_node_leaf_256 *) rt_copy_node(tree, (rt_node *) n128,
- RT_NODE_KIND_256);
+ RT_NODE_KIND_256, RT_CLASS_256);
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
{
if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
@@ -1500,7 +1525,8 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
- Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+ Assert(((rt_node *) n256)->fanout == 0);
+ Assert(chunk_exists || ((rt_node *) n256)->count < 256);
node_leaf_256_set(n256, chunk, value);
break;
@@ -1538,16 +1564,16 @@ rt_create(MemoryContext ctx)
tree->num_keys = 0;
/* Create the slab allocator for each size class */
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
tree->inner_slabs[i] = SlabContextCreate(ctx,
- rt_node_kind_info[i].name,
- rt_node_kind_info[i].inner_blocksize,
- rt_node_kind_info[i].inner_size);
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].inner_size);
tree->leaf_slabs[i] = SlabContextCreate(ctx,
- rt_node_kind_info[i].name,
- rt_node_kind_info[i].leaf_blocksize,
- rt_node_kind_info[i].leaf_size);
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].leaf_blocksize,
+ rt_size_class_info[i].leaf_size);
#ifdef RT_DEBUG
tree->cnt[i] = 0;
#endif
@@ -1564,7 +1590,7 @@ rt_create(MemoryContext ctx)
void
rt_free(radix_tree *tree)
{
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
MemoryContextDelete(tree->inner_slabs[i]);
MemoryContextDelete(tree->leaf_slabs[i]);
@@ -2076,7 +2102,7 @@ rt_memory_usage(radix_tree *tree)
{
Size total = sizeof(radix_tree);
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
@@ -2166,10 +2192,10 @@ rt_stats(radix_tree *tree)
ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
tree->num_keys,
tree->root->shift / RT_NODE_SPAN,
- tree->cnt[0],
- tree->cnt[1],
- tree->cnt[2],
- tree->cnt[3])));
+ tree->cnt[RT_CLASS_4_FULL],
+ tree->cnt[RT_CLASS_32_FULL],
+ tree->cnt[RT_CLASS_128_FULL],
+ tree->cnt[RT_CLASS_256])));
}
static void
@@ -2177,11 +2203,12 @@ rt_dump_node(rt_node *node, int level, bool recurse)
{
char space[128] = {0};
- fprintf(stderr, "[%s] kind %d, count %u, shift %u, chunk 0x%X:\n",
+ fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
NODE_IS_LEAF(node) ? "LEAF" : "INNR",
(node->kind == RT_NODE_KIND_4) ? 4 :
(node->kind == RT_NODE_KIND_32) ? 32 :
(node->kind == RT_NODE_KIND_128) ? 128 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
node->count, node->shift, node->chunk);
if (level > 0)
@@ -2384,13 +2411,13 @@ rt_dump_search(radix_tree *tree, uint64 key)
void
rt_dump(radix_tree *tree)
{
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
- rt_node_kind_info[i].name,
- rt_node_kind_info[i].inner_size,
- rt_node_kind_info[i].inner_blocksize,
- rt_node_kind_info[i].leaf_size,
- rt_node_kind_info[i].leaf_blocksize);
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_size,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].leaf_size,
+ rt_size_class_info[i].leaf_blocksize);
fprintf(stderr, "max_val = %lu\n", tree->max_val);
if (!tree->root)
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index cb3596755d..de1cd6cd70 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -40,7 +40,6 @@ static const bool rt_test_stats = false;
/* The maximum number of entries each node type can have */
static int rt_node_max_entries[] = {
4, /* RT_NODE_KIND_4 */
- 16, /* RT_NODE_KIND_16 */
32, /* RT_NODE_KIND_32 */
128, /* RT_NODE_KIND_128 */
256 /* RT_NODE_KIND_256 */
--
2.38.1
On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Nov 16, 2022 at 4:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Nov 16, 2022 at 1:46 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Thanks! Please let me know if there is something I can help with.
I didn't get very far because the tests fail on 0004 in rt_verify_node:
TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
Which tests do you use to get this assertion failure? I've confirmed
there is a bug in 0005 patch but without it, "make check-world"
passed.Hmm, I started over and rebuilt and it didn't reproduce. Not sure what happened, sorry for the noise.
Good to know. No problem.
I'm attaching a test I wrote to stress test branch prediction in search, and while trying it out I found two possible issues.
Thank you for testing!
It's based on the random int load test, but tests search speed. Run like this:
select * from bench_search_random_nodes(10 * 1000 * 1000)
It also takes some care to include all the different node kinds, restricting the possible keys by AND-ing with a filter. Here's a simple demo:
filter = ((uint64)1<<40)-1;
LOG: num_keys = 9999967, height = 4, n4 = 17513814, n32 = 6320, n128 = 62663, n256 = 3130Just using random integers leads to >99% using the smallest node. I wanted to get close to having the same number of each, but that's difficult while still using random inputs. I ended up using
filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)
which gives
LOG: num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 = 182670, n256 = 1024
Which seems okay for the task. One puzzling thing I found while trying various filters is that sometimes the reported tree height would change. For example:
filter = (((uint64) 1<<32) | (0xFF<<24));
LOG: num_keys = 9999944, height = 7, n4 = 47515559, n32 = 6209, n128 = 62632, n256 = 31611) Any idea why the tree height would be reported as 7 here? I didn't expect that.
In my environment, (0xFF<<24) is 0xFFFFFFFFFF000000, not 0xFF000000.
It seems the filter should be (((uint64) 1<<32) | ((uint64)
0xFF<<24)).2) It seems that 0004 actually causes a significant slowdown in this test (as in the attached, using the second filter above and with turboboost disabled):
v9 0003: 2062 2051 2050
v9 0004: 2346 2316 2321That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in the course of separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to template local and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case.
I'll also run the test on my environment and do the investigation tomorrow.
FYI I've not tested the patch you shared today but here are the
benchmark results I did with the v9 patch in my environment (I used
the second filter). I splitted 0004 patch into two patches: a patch
for pure refactoring patch to introduce rt_node_ptr and a patch to do
pointer tagging.
v9 0003 patch : 1113 1114 1114
introduce rt_node_ptr: 1127 1128 1128
pointer tagging : 1085 1087 1086 (equivalent to 0004 patch)
In my environment, rt_node_ptr seemed to lead some overhead but
pointer tagging had performance benefits. I'm not sure the reason why
the results are different from yours. The radix tree stats shows the
same as your tests.
=# select * from bench_search_random_nodes(10 * 1000 * 1000);
2022-11-18 22:18:21.608 JST [3913544] LOG: num_keys = 9291812, height
= 4, n4 = 262144, n32 =79603, n128 = 182670, n256 = 1024
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
FYI I've not tested the patch you shared today but here are the
benchmark results I did with the v9 patch in my environment (I used
the second filter). I splitted 0004 patch into two patches: a patch
for pure refactoring patch to introduce rt_node_ptr and a patch to do
pointer tagging.v9 0003 patch : 1113 1114 1114
introduce rt_node_ptr: 1127 1128 1128
pointer tagging : 1085 1087 1086 (equivalent to 0004 patch)In my environment, rt_node_ptr seemed to lead some overhead but
pointer tagging had performance benefits. I'm not sure the reason why
the results are different from yours. The radix tree stats shows the
same as your tests.
There is less than 2% difference from the medial set of results, so it's
hard to distinguish from noise. I did a fresh rebuild and retested with the
same results: about 15% slowdown in v9 0004. That's strange.
On Wed, Nov 16, 2022 at 10:24 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
filter = (((uint64) 1<<32) | (0xFF<<24));
LOG: num_keys = 9999944, height = 7, n4 = 47515559, n32 = 6209, n128 =
62632, n256 = 3161
1) Any idea why the tree height would be reported as 7 here? I didn't
expect that.
In my environment, (0xFF<<24) is 0xFFFFFFFFFF000000, not 0xFF000000.
It seems the filter should be (((uint64) 1<<32) | ((uint64)
0xFF<<24)).
Ugh, sign extension, brain fade on my part. Thanks, I'm glad there was a
straightforward explanation.
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Wed, Nov 16, 2022 at 4:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
That means my idea for the pointer struct might have some problems,
at least as currently implemented. Maybe in the course of separating out
and polishing that piece, an inefficiency will fall out. Or, it might be
another reason to template local and shared separately. Not sure yet. I
also haven't tried to adjust this test for the shared memory case.
Digging a bit deeper, I see a flaw in my benchmark: Even though the total
distribution of node kinds is decently even, the pattern that the benchmark
sees is not terribly random:
3,343,352 branch-misses:u # 0.85% of all
branches
393,204,959 branches:u
Recall a previous benchmark [1]/messages/by-id/CAFBsxsFEVckVzsBsfgGzGR4Yz=Jp=UxOtjYvTjOz6fOoLXtOig@mail.gmail.com where the leaf node was about half node16
and half node32. Randomizing the leaf node between the two caused branch
misses to go from 1% to 2%, causing a noticeable slowdown. Maybe in this
new benchmark, each level has a skewed distribution of nodes, giving a
smart branch predictor something to work with. We will need a way to
efficiently generate keys that lead to a relatively unpredictable
distribution of node kinds, as seen by a searcher. Especially in the leaves
(or just above the leaves), since those are less likely to be cached.
I'll also run the test on my environment and do the investigation
tomorrow.
FYI I've not tested the patch you shared today but here are the
benchmark results I did with the v9 patch in my environment (I used
the second filter). I splitted 0004 patch into two patches: a patch
for pure refactoring patch to introduce rt_node_ptr and a patch to do
pointer tagging.
Would you be able to share the refactoring patch? And a fix for the failing
tests? I'm thinking I want to try the templating approach fairly soon.
[1]: /messages/by-id/CAFBsxsFEVckVzsBsfgGzGR4Yz=Jp=UxOtjYvTjOz6fOoLXtOig@mail.gmail.com
/messages/by-id/CAFBsxsFEVckVzsBsfgGzGR4Yz=Jp=UxOtjYvTjOz6fOoLXtOig@mail.gmail.com
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Nov 18, 2022 at 2:48 PM I wrote:
One issue with this patch: The "fanout" member is a uint8, so it can't
hold 256 for the largest node kind. That's not an issue in practice, since
we never need to grow it, and we only compare that value with the count in
an Assert(), so I just set it to zero. That does break an invariant, so
it's not great. We could use 2 bytes to be strictly correct in all cases,
but that limits what we can do with the smallest node kind.
Thinking about this part, there's an easy resolution -- use a different
macro for fixed- and variable-sized node kinds to determine if there is a
free slot.
Also, I wanted to share some results of adjusting the boundary between the
two smallest node kinds. In the hackish attached patch, I modified the
fixed height search benchmark to search a small (within L1 cache) tree
thousands of times. For the first set I modified node4's maximum fanout and
filled it up. For the second, I set node4's fanout to 1, which causes 2+ to
spill to node32 (actually the partially-filled node15 size class
as demoed earlier).
node4:
NOTICE: num_keys = 16, height = 3, n4 = 15, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
2 | 16 | 16520 | 0 | 3
NOTICE: num_keys = 81, height = 3, n4 = 40, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
3 | 81 | 16456 | 0 | 17
NOTICE: num_keys = 256, height = 3, n4 = 85, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
4 | 256 | 16456 | 0 | 89
NOTICE: num_keys = 625, height = 3, n4 = 156, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
5 | 625 | 16488 | 0 | 327
node32:
NOTICE: num_keys = 16, height = 3, n4 = 0, n15 = 15, n32 = 0, n128 = 0,
n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
2 | 16 | 16488 | 0 | 5
(1 row)
NOTICE: num_keys = 81, height = 3, n4 = 0, n15 = 40, n32 = 0, n128 = 0,
n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
3 | 81 | 16520 | 0 | 28
NOTICE: num_keys = 256, height = 3, n4 = 0, n15 = 85, n32 = 0, n128 = 0,
n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
4 | 256 | 16408 | 0 | 79
NOTICE: num_keys = 625, height = 3, n4 = 0, n15 = 156, n32 = 0, n128 = 0,
n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
5 | 625 | 24616 | 0 | 199
In this test, node32 seems slightly faster than node4 with 4 elements, at
the cost of more memory.
Assuming the smallest node is fixed size (i.e. fanout/capacity member not
part of the common set, so only part of variable-sized nodes), 3 has a nice
property: no wasted padding space:
node4: 5 + 4+(7) + 4*8 = 48 bytes
node3: 5 + 3 + 3*8 = 32
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Nov 21, 2022 at 3:43 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Nov 16, 2022 at 4:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in the course of separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to template local and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case.
Digging a bit deeper, I see a flaw in my benchmark: Even though the total distribution of node kinds is decently even, the pattern that the benchmark sees is not terribly random:
3,343,352 branch-misses:u # 0.85% of all branches
393,204,959 branches:uRecall a previous benchmark [1] where the leaf node was about half node16 and half node32. Randomizing the leaf node between the two caused branch misses to go from 1% to 2%, causing a noticeable slowdown. Maybe in this new benchmark, each level has a skewed distribution of nodes, giving a smart branch predictor something to work with. We will need a way to efficiently generate keys that lead to a relatively unpredictable distribution of node kinds, as seen by a searcher. Especially in the leaves (or just above the leaves), since those are less likely to be cached.
I'll also run the test on my environment and do the investigation tomorrow.
FYI I've not tested the patch you shared today but here are the
benchmark results I did with the v9 patch in my environment (I used
the second filter). I splitted 0004 patch into two patches: a patch
for pure refactoring patch to introduce rt_node_ptr and a patch to do
pointer tagging.Would you be able to share the refactoring patch? And a fix for the failing tests? I'm thinking I want to try the templating approach fairly soon.
Sure. I've attached the v10 patches. 0004 is the pure refactoring
patch and 0005 patch introduces the pointer tagging.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v10-0003-tool-for-measuring-radix-tree-performance.patchapplication/octet-stream; name=v10-0003-tool-for-measuring-radix-tree-performance.patchDownload
From 5cd4f1f8435d5367e09b8044c08e153ae05f2f19 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v10 3/7] tool for measuring radix tree performance
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 64 +++
contrib/bench_radix_tree/bench_radix_tree.c | 541 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
6 files changed, 661 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..e0205b364e
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,64 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..70ca989118
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,541 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 search_time_ms;
+ Datum values[2] = {0};
+ bool nulls[2] = {0};
+ /* from trial and error */
+ const uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+
+ rt_set(rt, key, key);
+ }
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
--
2.31.1
v10-0004-Use-rt_node_ptr-to-reference-radix-tree-nodes.patchapplication/octet-stream; name=v10-0004-Use-rt_node_ptr-to-reference-radix-tree-nodes.patchDownload
From 082277fda9061c8651b3cc4d2e70b763d508bb1a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 14 Nov 2022 11:44:17 +0900
Subject: [PATCH v10 4/7] Use rt_node_ptr to reference radix tree nodes.
---
src/backend/lib/radixtree.c | 652 ++++++++++++++++++++----------------
1 file changed, 369 insertions(+), 283 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 6159b73b75..67f4dc646e 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -126,6 +126,21 @@ typedef enum
#define RT_NODE_KIND_128 0x02
#define RT_NODE_KIND_256 0x03
#define RT_NODE_KIND_COUNT 4
+#define RT_POINTER_KIND_MASK 0x03
+
+/*
+ * rt_pointer is a tagged pointer for rt_node. It is encoded from a
+ * C-pointer (ie, local memory address) and the node kind. The node
+ * kind uses the lower 2 bits, which are always 0 in local memory address.
+ * We can encode and decode the pointer using by rt_pointer_decode()
+ * and rt_pointer_encode() functions, respectively.
+ *
+ * The inner nodes of the radix tree need to store rt_pointer rather than
+ * C-pointer for the above reason.
+ */
+typedef uintptr_t rt_pointer;
+#define InvalidRTPointer ((rt_pointer) 0)
+#define RTPointerIsValid(x) (((rt_pointer) (x)) != InvalidRTPointer)
/* Common type for all nodes types */
typedef struct rt_node
@@ -147,10 +162,7 @@ typedef struct rt_node
/* Size kind of the node */
uint8 kind;
} rt_node;
-#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
-#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
-#define NODE_HAS_FREE_SLOT(n) \
- (((rt_node *) (n))->count < rt_node_kind_info[((rt_node *) (n))->kind].fanout)
+#define RT_NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
/* Base type of each node kinds for leaf and inner nodes */
typedef struct rt_node_base_4
@@ -205,7 +217,7 @@ typedef struct rt_node_inner_4
rt_node_base_4 base;
/* 4 children, for key chunks */
- rt_node *children[4];
+ rt_pointer children[4];
} rt_node_inner_4;
typedef struct rt_node_leaf_4
@@ -221,7 +233,7 @@ typedef struct rt_node_inner_32
rt_node_base_32 base;
/* 32 children, for key chunks */
- rt_node *children[32];
+ rt_pointer children[32];
} rt_node_inner_32;
typedef struct rt_node_leaf_32
@@ -237,7 +249,7 @@ typedef struct rt_node_inner_128
rt_node_base_128 base;
/* Slots for 128 children */
- rt_node *children[128];
+ rt_pointer children[128];
} rt_node_inner_128;
typedef struct rt_node_leaf_128
@@ -260,7 +272,7 @@ typedef struct rt_node_inner_256
rt_node_base_256 base;
/* Slots for 256 children */
- rt_node *children[RT_NODE_MAX_SLOTS];
+ rt_pointer children[RT_NODE_MAX_SLOTS];
} rt_node_inner_256;
typedef struct rt_node_leaf_256
@@ -274,6 +286,30 @@ typedef struct rt_node_leaf_256
uint64 values[RT_NODE_MAX_SLOTS];
} rt_node_leaf_256;
+/*
+ * rt_node_ptr is an useful data structure representing a pointer for a rt_node.
+ */
+typedef struct rt_node_ptr
+{
+ rt_pointer encoded;
+ rt_node *decoded;
+} rt_node_ptr;
+#define InvalidRTNodePtr \
+ (rt_node_ptr) {.encoded = InvalidRTPointer, .decoded = NULL }
+#define RTNodePtrIsValid(n) \
+ (!rt_node_ptr_eq((rt_node_ptr *) &(n), &(InvalidRTNodePtr)))
+
+/* Macros for rt_node_ptr to access the fields of rt_node */
+#define NODE_RAW(n) (((rt_node_ptr) (n)).decoded)
+#define NODE_IS_LEAF(n) (NODE_RAW(n)->shift == 0)
+#define NODE_IS_EMPTY(n) (NODE_COUNT(n) == 0)
+#define NODE_KIND(n) (NODE_RAW(n)->kind)
+#define NODE_COUNT(n) (NODE_RAW(n)->count)
+#define NODE_SHIFT(n) (NODE_RAW(n)->shift)
+#define NODE_CHUNK(n) (NODE_RAW(n)->chunk)
+#define NODE_HAS_FREE_SLOT(n) \
+ (NODE_COUNT(n) < rt_node_kind_info[NODE_KIND(n)].fanout)
+
/* Information of each size kinds */
typedef struct rt_node_kind_info_elem
{
@@ -347,7 +383,7 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
*/
typedef struct rt_node_iter
{
- rt_node *node; /* current node being iterated */
+ rt_node_ptr node; /* current node being iterated */
int current_idx; /* current position. -1 for initial value */
} rt_node_iter;
@@ -368,7 +404,7 @@ struct radix_tree
{
MemoryContext context;
- rt_node *root;
+ rt_pointer root;
uint64 max_val;
uint64 num_keys;
@@ -382,26 +418,56 @@ struct radix_tree
};
static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node *rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
- bool inner);
-static void rt_free_node(radix_tree *tree, rt_node *node);
+static rt_node_ptr rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
+ bool inner);
+static void rt_free_node(radix_tree *tree, rt_node_ptr node);
static void rt_extend(radix_tree *tree, uint64 key);
-static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
- rt_node **child_p);
-static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+static inline bool rt_node_search_inner(rt_node_ptr node_ptr, uint64 key, rt_action action,
+ rt_pointer *child_p);
+static inline bool rt_node_search_leaf(rt_node_ptr node_ptr, uint64 key, rt_action action,
uint64 *value_p);
-static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
- uint64 key, rt_node *child);
-static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+static bool rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+ uint64 key, rt_node_ptr child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
uint64 key, uint64 value);
-static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ rt_node_ptr *child_p);
static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
uint64 *value_p);
-static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static void rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from);
static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
/* verification (available only with assertion) */
-static void rt_verify_node(rt_node *node);
+static void rt_verify_node(rt_node_ptr node);
+
+/* Decode and encode function of rt_pointer */
+static inline rt_node *
+rt_pointer_decode(rt_pointer encoded)
+{
+ return (rt_node *) encoded;
+}
+
+static inline rt_pointer
+rt_pointer_encode(rt_node *decoded)
+{
+ return (rt_pointer) decoded;
+}
+
+/* Return a rt_pointer created from the given encoded pointer */
+static inline rt_node_ptr
+rt_node_ptr_encoded(rt_pointer encoded)
+{
+ return (rt_node_ptr) {
+ .encoded = encoded,
+ .decoded = rt_pointer_decode(encoded)
+ };
+}
+
+static inline bool
+rt_node_ptr_eq(rt_node_ptr *a, rt_node_ptr *b)
+{
+ return (a->decoded == b->decoded) && (a->encoded == b->encoded);
+}
/*
* Return index of the first element in 'base' that equals 'key'. Return -1
@@ -550,10 +616,10 @@ node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
/* Shift the elements right at 'idx' by one */
static inline void
-chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_shift(uint8 *chunks, rt_pointer *children, int count, int idx)
{
memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
- memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_pointer) * (count - idx));
}
static inline void
@@ -565,10 +631,10 @@ chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
/* Delete the element at 'idx' */
static inline void
-chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_delete(uint8 *chunks, rt_pointer *children, int count, int idx)
{
memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
- memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_pointer) * (count - idx - 1));
}
static inline void
@@ -580,15 +646,15 @@ chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
/* Copy both chunks and children/values arrays */
static inline void
-chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
- uint8 *dst_chunks, rt_node **dst_children, int count)
+chunk_children_array_copy(uint8 *src_chunks, rt_pointer *src_children,
+ uint8 *dst_chunks, rt_pointer *dst_children, int count)
{
/* For better code generation */
if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
pg_unreachable();
memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
- memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+ memcpy(dst_children, src_children, sizeof(rt_pointer) * count);
}
static inline void
@@ -616,28 +682,28 @@ node_128_is_chunk_used(rt_node_base_128 *node, uint8 chunk)
static inline bool
node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
{
- Assert(!NODE_IS_LEAF(node));
- return (node->children[slot] != NULL);
+ Assert(!RT_NODE_IS_LEAF(node));
+ return RTPointerIsValid(node->children[slot]);
}
static inline bool
node_leaf_128_is_slot_used(rt_node_leaf_128 *node, uint8 slot)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
}
-static inline rt_node *
+static inline rt_pointer
node_inner_128_get_child(rt_node_inner_128 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
return node->children[node->base.slot_idxs[chunk]];
}
static inline uint64
node_leaf_128_get_value(rt_node_leaf_128 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
Assert(((rt_node_base_128 *) node)->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX);
return node->values[node->base.slot_idxs[chunk]];
}
@@ -645,7 +711,7 @@ node_leaf_128_get_value(rt_node_leaf_128 *node, uint8 chunk)
static void
node_inner_128_delete(rt_node_inner_128 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
}
@@ -654,7 +720,7 @@ node_leaf_128_delete(rt_node_leaf_128 *node, uint8 chunk)
{
int slotpos = node->base.slot_idxs[chunk];
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
}
@@ -665,7 +731,7 @@ node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
{
int slotpos = 0;
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
while (node_inner_128_is_slot_used(node, slotpos))
slotpos++;
@@ -677,7 +743,7 @@ node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
{
int slotpos;
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
/* We iterate over the isset bitmap per byte then check each bit */
for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
@@ -695,11 +761,11 @@ node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
}
static inline void
-node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_pointer child)
{
int slotpos;
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
/* find unused slot */
slotpos = node_inner_128_find_unused_slot(node, chunk);
@@ -714,7 +780,7 @@ node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
{
int slotpos;
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
/* find unused slot */
slotpos = node_leaf_128_find_unused_slot(node, chunk);
@@ -726,16 +792,16 @@ node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
/* Update the child corresponding to 'chunk' to 'child' */
static inline void
-node_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+node_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_pointer child)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->children[node->base.slot_idxs[chunk]] = child;
}
static inline void
node_leaf_128_update(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->values[node->base.slot_idxs[chunk]] = value;
}
@@ -745,21 +811,21 @@ node_leaf_128_update(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
static inline bool
node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
- return (node->children[chunk] != NULL);
+ Assert(!RT_NODE_IS_LEAF(node));
+ return RTPointerIsValid(node->children[chunk]);
}
static inline bool
node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
}
-static inline rt_node *
+static inline rt_pointer
node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
Assert(node_inner_256_is_chunk_used(node, chunk));
return node->children[chunk];
}
@@ -767,16 +833,16 @@ node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
static inline uint64
node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
Assert(node_leaf_256_is_chunk_used(node, chunk));
return node->values[chunk];
}
/* Set the child in the node-256 */
static inline void
-node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_pointer child)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->children[chunk] = child;
}
@@ -784,7 +850,7 @@ node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
static inline void
node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
node->values[chunk] = value;
}
@@ -793,14 +859,14 @@ node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
static inline void
node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
- node->children[chunk] = NULL;
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = InvalidRTPointer;
}
static inline void
node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
}
@@ -835,37 +901,37 @@ static void
rt_new_root(radix_tree *tree, uint64 key)
{
int shift = key_get_shift(key);
- rt_node *node;
+ rt_node_ptr node;
- node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0,
- shift > 0);
+ node = rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0, shift > 0);
tree->max_val = shift_get_max_val(shift);
- tree->root = node;
+ tree->root = node.encoded;
}
/*
* Allocate a new node with the given node kind.
*/
-static rt_node *
+static rt_node_ptr
rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
{
- rt_node *newnode;
+ rt_node_ptr newnode;
if (inner)
- newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
- rt_node_kind_info[kind].inner_size);
+ newnode.decoded = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_kind_info[kind].inner_size);
else
- newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
- rt_node_kind_info[kind].leaf_size);
+ newnode.decoded = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_kind_info[kind].leaf_size);
- newnode->kind = kind;
- newnode->shift = shift;
- newnode->chunk = chunk;
+ newnode.encoded = rt_pointer_encode(newnode.decoded);
+ NODE_KIND(newnode) = kind;
+ NODE_SHIFT(newnode) = shift;
+ NODE_CHUNK(newnode) = chunk;
/* Initialize slot_idxs to invalid values */
if (kind == RT_NODE_KIND_128)
{
- rt_node_base_128 *n128 = (rt_node_base_128 *) newnode;
+ rt_node_base_128 *n128 = (rt_node_base_128 *) newnode.decoded;
memset(n128->slot_idxs, RT_NODE_128_INVALID_IDX, sizeof(n128->slot_idxs));
}
@@ -882,55 +948,56 @@ rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
* Create a new node with 'new_kind' and the same shift, chunk, and
* count of 'node'.
*/
-static rt_node *
-rt_copy_node(radix_tree *tree, rt_node *node, int new_kind)
+static rt_node_ptr
+rt_copy_node(radix_tree *tree, rt_node_ptr node, int new_kind)
{
- rt_node *newnode;
+ rt_node_ptr newnode;
+ rt_node *n = node.decoded;
- newnode = rt_alloc_node(tree, new_kind, node->shift, node->chunk,
- node->shift > 0);
- newnode->count = node->count;
+ newnode = rt_alloc_node(tree, new_kind, n->shift, n->chunk, n->shift > 0);
+ NODE_COUNT(newnode) = NODE_COUNT(node);
return newnode;
}
/* Free the given node */
static void
-rt_free_node(radix_tree *tree, rt_node *node)
+rt_free_node(radix_tree *tree, rt_node_ptr node)
{
/* If we're deleting the root node, make the tree empty */
- if (tree->root == node)
- tree->root = NULL;
+ if (tree->root == node.encoded)
+ tree->root = InvalidRTPointer;
#ifdef RT_DEBUG
/* update the statistics */
- tree->cnt[node->kind]--;
- Assert(tree->cnt[node->kind] >= 0);
+ tree->cnt[NODE_KIND(node)]--;
+ Assert(tree->cnt[NODE_KIND(node)] >= 0);
#endif
- pfree(node);
+ pfree(node.decoded);
}
/*
* Replace old_child with new_child, and free the old one.
*/
static void
-rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
- rt_node *new_child, uint64 key)
+rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
+ rt_node_ptr new_child, uint64 key)
{
- Assert(old_child->chunk == new_child->chunk);
- Assert(old_child->shift == new_child->shift);
+ Assert(NODE_CHUNK(old_child) == NODE_CHUNK(new_child));
+ Assert(NODE_SHIFT(old_child) == NODE_SHIFT(new_child));
- if (parent == old_child)
+ if (rt_node_ptr_eq(&parent, &old_child))
{
/* Replace the root node with the new large node */
- tree->root = new_child;
+ tree->root = new_child.encoded;
}
else
{
bool replaced PG_USED_FOR_ASSERTS_ONLY;
- replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+ replaced = rt_node_insert_inner(tree, InvalidRTNodePtr, parent, key,
+ new_child);
Assert(replaced);
}
@@ -945,23 +1012,26 @@ static void
rt_extend(radix_tree *tree, uint64 key)
{
int target_shift;
- int shift = tree->root->shift + RT_NODE_SPAN;
+ rt_node *root = rt_pointer_decode(tree->root);
+ int shift = root->shift + RT_NODE_SPAN;
target_shift = key_get_shift(key);
/* Grow tree from 'shift' to 'target_shift' */
while (shift <= target_shift)
{
- rt_node_inner_4 *node;
+ rt_node_ptr node;
+ rt_node_inner_4 *n4;
- node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4,
- shift, 0, true);
- node->base.n.count = 1;
- node->base.chunks[0] = 0;
- node->children[0] = tree->root;
+ node = rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0, true);
+ n4 = (rt_node_inner_4 *) node.decoded;
- tree->root->chunk = 0;
- tree->root = (rt_node *) node;
+ n4->base.n.count = 1;
+ n4->base.chunks[0] = 0;
+ n4->children[0] = tree->root;
+
+ root->chunk = 0;
+ tree->root = node.encoded;
shift += RT_NODE_SPAN;
}
@@ -974,18 +1044,18 @@ rt_extend(radix_tree *tree, uint64 key)
* Insert inner and leaf nodes from 'node' to bottom.
*/
static inline void
-rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
- rt_node *node)
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
+ rt_node_ptr node)
{
- int shift = node->shift;
+ int shift = NODE_SHIFT(node);
while (shift >= RT_NODE_SPAN)
{
- rt_node *newchild;
+ rt_node_ptr newchild;
int newshift = shift - RT_NODE_SPAN;
newchild = rt_alloc_node(tree, RT_NODE_KIND_4, newshift,
- RT_GET_KEY_CHUNK(key, node->shift),
+ RT_GET_KEY_CHUNK(key, NODE_SHIFT(node)),
newshift > 0);
rt_node_insert_inner(tree, parent, node, key, newchild);
@@ -1006,17 +1076,18 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
* pointer is set to child_p.
*/
static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
+ rt_pointer *child_p)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool found = false;
- rt_node *child = NULL;
+ rt_pointer child;
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
if (idx < 0)
@@ -1034,7 +1105,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
}
case RT_NODE_KIND_32:
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
if (idx < 0)
@@ -1050,7 +1121,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
}
case RT_NODE_KIND_128:
{
- rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node.decoded;
if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
break;
@@ -1066,7 +1137,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
}
case RT_NODE_KIND_256:
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
if (!node_inner_256_is_chunk_used(n256, chunk))
break;
@@ -1083,7 +1154,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
/* update statistics */
if (action == RT_ACTION_DELETE && found)
- node->count--;
+ NODE_COUNT(node)--;
if (found && child_p)
*child_p = child;
@@ -1099,17 +1170,17 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
* to the value is set to value_p.
*/
static inline bool
-rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+rt_node_search_leaf(rt_node_ptr node, uint64 key, rt_action action, uint64 *value_p)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool found = false;
uint64 value = 0;
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
if (idx < 0)
@@ -1127,7 +1198,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
}
case RT_NODE_KIND_32:
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
if (idx < 0)
@@ -1143,7 +1214,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
}
case RT_NODE_KIND_128:
{
- rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node.decoded;
if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
break;
@@ -1159,7 +1230,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
}
case RT_NODE_KIND_256:
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
if (!node_leaf_256_is_chunk_used(n256, chunk))
break;
@@ -1176,7 +1247,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
/* update statistics */
if (action == RT_ACTION_DELETE && found)
- node->count--;
+ NODE_COUNT(node)--;
if (found && value_p)
*value_p = value;
@@ -1186,19 +1257,19 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
/* Insert the child to the inner node */
static bool
-rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
- rt_node *child)
+rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+ uint64 key, rt_node_ptr child)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool chunk_exists = false;
Assert(!NODE_IS_LEAF(node));
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
int idx;
idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1206,25 +1277,26 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
{
/* found the existing chunk */
chunk_exists = true;
- n4->children[idx] = child;
+ n4->children[idx] = child.encoded;
break;
}
- if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+ if (unlikely(!NODE_HAS_FREE_SLOT(node)))
{
+ rt_node_ptr new;
rt_node_inner_32 *new32;
/* grow node from 4 to 32 */
- new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ new = rt_copy_node(tree, node, RT_NODE_KIND_32);
+ new32 = (rt_node_inner_32 *) new.decoded;
+
chunk_children_array_copy(n4->base.chunks, n4->children,
new32->base.chunks, new32->children,
n4->base.n.count);
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
- key);
- node = (rt_node *) new32;
+ Assert(RTNodePtrIsValid(parent));
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1237,14 +1309,14 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
count, insertpos);
n4->base.chunks[insertpos] = chunk;
- n4->children[insertpos] = child;
+ n4->children[insertpos] = child.encoded;
break;
}
}
/* FALLTHROUGH */
case RT_NODE_KIND_32:
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
int idx;
idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1252,24 +1324,25 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
{
/* found the existing chunk */
chunk_exists = true;
- n32->children[idx] = child;
+ n32->children[idx] = child.encoded;
break;
}
- if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+ if (unlikely(!NODE_HAS_FREE_SLOT(node)))
{
+ rt_node_ptr new;
rt_node_inner_128 *new128;
/* grow node from 32 to 128 */
- new128 = (rt_node_inner_128 *) rt_copy_node(tree, (rt_node *) n32,
- RT_NODE_KIND_128);
+ new = rt_copy_node(tree, node, RT_NODE_KIND_128);
+ new128 = (rt_node_inner_128 *) new.decoded;
+
for (int i = 0; i < n32->base.n.count; i++)
node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
- key);
- node = (rt_node *) new128;
+ Assert(RTNodePtrIsValid(parent));
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1281,31 +1354,33 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
count, insertpos);
n32->base.chunks[insertpos] = chunk;
- n32->children[insertpos] = child;
+ n32->children[insertpos] = child.encoded;
break;
}
}
/* FALLTHROUGH */
case RT_NODE_KIND_128:
{
- rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node.decoded;
int cnt = 0;
if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
{
/* found the existing chunk */
chunk_exists = true;
- node_inner_128_update(n128, chunk, child);
+ node_inner_128_update(n128, chunk, child.encoded);
break;
}
- if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+ if (unlikely(!NODE_HAS_FREE_SLOT(node)))
{
+ rt_node_ptr new;
rt_node_inner_256 *new256;
/* grow node from 128 to 256 */
- new256 = (rt_node_inner_256 *) rt_copy_node(tree, (rt_node *) n128,
- RT_NODE_KIND_256);
+ new = rt_copy_node(tree, node, RT_NODE_KIND_256);
+ new256 = (rt_node_inner_256 *) new.decoded;
+
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
{
if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
@@ -1315,33 +1390,32 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
cnt++;
}
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
- key);
- node = (rt_node *) new256;
+ Assert(RTNodePtrIsValid(parent));
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
- node_inner_128_insert(n128, chunk, child);
+ node_inner_128_insert(n128, chunk, child.encoded);
break;
}
}
/* FALLTHROUGH */
case RT_NODE_KIND_256:
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
- Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+ Assert(chunk_exists || NODE_HAS_FREE_SLOT(node));
- node_inner_256_set(n256, chunk, child);
+ node_inner_256_set(n256, chunk, child.encoded);
break;
}
}
/* Update statistics */
if (!chunk_exists)
- node->count++;
+ NODE_COUNT(node)++;
/*
* Done. Finally, verify the chunk and value is inserted or replaced
@@ -1354,19 +1428,19 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
/* Insert the value to the leaf node */
static bool
-rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
uint64 key, uint64 value)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool chunk_exists = false;
Assert(NODE_IS_LEAF(node));
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
int idx;
idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1378,21 +1452,22 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
break;
}
- if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+ if (unlikely(!NODE_HAS_FREE_SLOT(node)))
{
+ rt_node_ptr new;
rt_node_leaf_32 *new32;
/* grow node from 4 to 32 */
- new32 = (rt_node_leaf_32 *) rt_copy_node(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ new = rt_copy_node(tree, node, RT_NODE_KIND_32);
+ new32 = (rt_node_leaf_32 *) new.decoded;
+
chunk_values_array_copy(n4->base.chunks, n4->values,
new32->base.chunks, new32->values,
n4->base.n.count);
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
- key);
- node = (rt_node *) new32;
+ Assert(RTNodePtrIsValid(parent));
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1412,7 +1487,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* FALLTHROUGH */
case RT_NODE_KIND_32:
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
int idx;
idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1424,20 +1499,21 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
break;
}
- if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+ if (unlikely(!NODE_HAS_FREE_SLOT(node)))
{
+ rt_node_ptr new;
rt_node_leaf_128 *new128;
/* grow node from 32 to 128 */
- new128 = (rt_node_leaf_128 *) rt_copy_node(tree, (rt_node *) n32,
- RT_NODE_KIND_128);
+ new = rt_copy_node(tree, node, RT_NODE_KIND_128);
+ new128 = (rt_node_leaf_128 *) new.decoded;
+
for (int i = 0; i < n32->base.n.count; i++)
node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
- key);
- node = (rt_node *) new128;
+ Assert(RTNodePtrIsValid(parent));
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1456,7 +1532,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* FALLTHROUGH */
case RT_NODE_KIND_128:
{
- rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node.decoded;
int cnt = 0;
if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
@@ -1467,13 +1543,15 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
break;
}
- if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+ if (unlikely(!NODE_HAS_FREE_SLOT(node)))
{
+ rt_node_ptr new;
rt_node_leaf_256 *new256;
/* grow node from 128 to 256 */
- new256 = (rt_node_leaf_256 *) rt_copy_node(tree, (rt_node *) n128,
- RT_NODE_KIND_256);
+ new = rt_copy_node(tree, node, RT_NODE_KIND_256);
+ new256 = (rt_node_leaf_256 *) new.decoded;
+
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
{
if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
@@ -1483,10 +1561,9 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
cnt++;
}
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
- key);
- node = (rt_node *) new256;
+ Assert(RTNodePtrIsValid(parent));
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1497,10 +1574,10 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* FALLTHROUGH */
case RT_NODE_KIND_256:
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
- Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+ Assert(chunk_exists || NODE_HAS_FREE_SLOT(node));
node_leaf_256_set(n256, chunk, value);
break;
@@ -1509,7 +1586,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* Update statistics */
if (!chunk_exists)
- node->count++;
+ NODE_COUNT(node)++;
/*
* Done. Finally, verify the chunk and value is inserted or replaced
@@ -1533,7 +1610,7 @@ rt_create(MemoryContext ctx)
tree = palloc(sizeof(radix_tree));
tree->context = ctx;
- tree->root = NULL;
+ tree->root = InvalidRTPointer;
tree->max_val = 0;
tree->num_keys = 0;
@@ -1582,26 +1659,23 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
{
int shift;
bool updated;
- rt_node *node;
- rt_node *parent;
+ rt_node_ptr node;
+ rt_node_ptr parent;
/* Empty tree, create the root */
- if (!tree->root)
+ if (!RTPointerIsValid(tree->root))
rt_new_root(tree, key);
/* Extend the tree if necessary */
if (key > tree->max_val)
rt_extend(tree, key);
- Assert(tree->root);
-
- shift = tree->root->shift;
- node = parent = tree->root;
-
/* Descend the tree until a leaf node */
+ node = parent = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
while (shift >= 0)
{
- rt_node *child;
+ rt_pointer child;
if (NODE_IS_LEAF(node))
break;
@@ -1613,7 +1687,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
}
parent = node;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
}
@@ -1634,21 +1708,21 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
bool
rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
{
- rt_node *node;
+ rt_node_ptr node;
int shift;
Assert(value_p != NULL);
- if (!tree->root || key > tree->max_val)
+ if (!RTPointerIsValid(tree->root) || key > tree->max_val)
return false;
- node = tree->root;
- shift = tree->root->shift;
+ node = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
/* Descend the tree until a leaf node */
while (shift >= 0)
{
- rt_node *child;
+ rt_pointer child;
if (NODE_IS_LEAF(node))
break;
@@ -1656,7 +1730,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
}
@@ -1670,8 +1744,8 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
bool
rt_delete(radix_tree *tree, uint64 key)
{
- rt_node *node;
- rt_node *stack[RT_MAX_LEVEL] = {0};
+ rt_node_ptr node;
+ rt_node_ptr stack[RT_MAX_LEVEL] = {0};
int shift;
int level;
bool deleted;
@@ -1683,12 +1757,12 @@ rt_delete(radix_tree *tree, uint64 key)
* Descend the tree to search the key while building a stack of nodes we
* visited.
*/
- node = tree->root;
- shift = tree->root->shift;
+ node = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
level = -1;
while (shift > 0)
{
- rt_node *child;
+ rt_pointer child;
/* Push the current node to the stack */
stack[++level] = node;
@@ -1696,7 +1770,7 @@ rt_delete(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
}
@@ -1745,7 +1819,7 @@ rt_delete(radix_tree *tree, uint64 key)
*/
if (level == 0)
{
- tree->root = NULL;
+ tree->root = InvalidRTPointer;
tree->max_val = 0;
}
@@ -1757,6 +1831,7 @@ rt_iter *
rt_begin_iterate(radix_tree *tree)
{
MemoryContext old_ctx;
+ rt_node_ptr root;
rt_iter *iter;
int top_level;
@@ -1766,17 +1841,18 @@ rt_begin_iterate(radix_tree *tree)
iter->tree = tree;
/* empty tree */
- if (!iter->tree)
+ if (!RTPointerIsValid(iter->tree))
return iter;
- top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ root = rt_node_ptr_encoded(iter->tree->root);
+ top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
iter->stack_len = top_level;
/*
* Descend to the left most leaf node from the root. The key is being
* constructed while descending to the leaf.
*/
- rt_update_iter_stack(iter, iter->tree->root, top_level);
+ rt_update_iter_stack(iter, root, top_level);
MemoryContextSwitchTo(old_ctx);
@@ -1787,14 +1863,15 @@ rt_begin_iterate(radix_tree *tree)
* Update each node_iter for inner nodes in the iterator node stack.
*/
static void
-rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
{
int level = from;
- rt_node *node = from_node;
+ rt_node_ptr node = from_node;
for (;;)
{
rt_node_iter *node_iter = &(iter->stack[level--]);
+ bool found PG_USED_FOR_ASSERTS_ONLY;
node_iter->node = node;
node_iter->current_idx = -1;
@@ -1804,10 +1881,10 @@ rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
return;
/* Advance to the next slot in the inner node */
- node = rt_node_inner_iterate_next(iter, node_iter);
+ found = rt_node_inner_iterate_next(iter, node_iter, &node);
/* We must find the first children in the node */
- Assert(node);
+ Assert(found);
}
}
@@ -1824,7 +1901,7 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
for (;;)
{
- rt_node *child = NULL;
+ rt_node_ptr child = InvalidRTNodePtr;
uint64 value;
int level;
bool found;
@@ -1845,14 +1922,12 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
*/
for (level = 1; level <= iter->stack_len; level++)
{
- child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
-
- if (child)
+ if (rt_node_inner_iterate_next(iter, &(iter->stack[level]), &child))
break;
}
/* the iteration finished */
- if (!child)
+ if (!RTNodePtrIsValid(child))
return false;
/*
@@ -1884,18 +1959,19 @@ rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
* Advance the slot in the inner node. Return the child if exists, otherwise
* null.
*/
-static inline rt_node *
-rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+static inline bool
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *child_p)
{
- rt_node *child = NULL;
+ rt_node_ptr node = node_iter->node;
+ rt_pointer child;
bool found = false;
uint8 key_chunk;
- switch (node_iter->node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n4->base.n.count)
@@ -1908,7 +1984,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
case RT_NODE_KIND_32:
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n32->base.n.count)
@@ -1921,7 +1997,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
case RT_NODE_KIND_128:
{
- rt_node_inner_128 *n128 = (rt_node_inner_128 *) node_iter->node;
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -1941,7 +2017,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
case RT_NODE_KIND_256:
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -1962,9 +2038,12 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
if (found)
- rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ {
+ rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
+ *child_p = rt_node_ptr_encoded(child);
+ }
- return child;
+ return found;
}
/*
@@ -1972,19 +2051,18 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
* is set to value_p, otherwise return false.
*/
static inline bool
-rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
- uint64 *value_p)
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_p)
{
- rt_node *node = node_iter->node;
+ rt_node_ptr node = node_iter->node;
bool found = false;
uint64 value;
uint8 key_chunk;
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n4->base.n.count)
@@ -1997,7 +2075,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
}
case RT_NODE_KIND_32:
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n32->base.n.count)
@@ -2010,7 +2088,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
}
case RT_NODE_KIND_128:
{
- rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node_iter->node;
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2030,7 +2108,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
}
case RT_NODE_KIND_256:
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2052,7 +2130,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
if (found)
{
- rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
*value_p = value;
}
@@ -2089,16 +2167,16 @@ rt_memory_usage(radix_tree *tree)
* Verify the radix tree node.
*/
static void
-rt_verify_node(rt_node *node)
+rt_verify_node(rt_node_ptr node)
{
#ifdef USE_ASSERT_CHECKING
- Assert(node->count >= 0);
+ Assert(NODE_COUNT(node) >= 0);
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+ rt_node_base_4 *n4 = (rt_node_base_4 *) node.decoded;
for (int i = 1; i < n4->n.count; i++)
Assert(n4->chunks[i - 1] < n4->chunks[i]);
@@ -2107,7 +2185,7 @@ rt_verify_node(rt_node *node)
}
case RT_NODE_KIND_32:
{
- rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+ rt_node_base_32 *n32 = (rt_node_base_32 *) node.decoded;
for (int i = 1; i < n32->n.count; i++)
Assert(n32->chunks[i - 1] < n32->chunks[i]);
@@ -2116,7 +2194,7 @@ rt_verify_node(rt_node *node)
}
case RT_NODE_KIND_128:
{
- rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+ rt_node_base_128 *n128 = (rt_node_base_128 *) node.decoded;
int cnt = 0;
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2126,10 +2204,10 @@ rt_verify_node(rt_node *node)
/* Check if the corresponding slot is used */
if (NODE_IS_LEAF(node))
- Assert(node_leaf_128_is_slot_used((rt_node_leaf_128 *) node,
+ Assert(node_leaf_128_is_slot_used((rt_node_leaf_128 *) n128,
n128->slot_idxs[i]));
else
- Assert(node_inner_128_is_slot_used((rt_node_inner_128 *) node,
+ Assert(node_inner_128_is_slot_used((rt_node_inner_128 *) n128,
n128->slot_idxs[i]));
cnt++;
@@ -2142,7 +2220,7 @@ rt_verify_node(rt_node *node)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
int cnt = 0;
for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
@@ -2163,9 +2241,11 @@ rt_verify_node(rt_node *node)
void
rt_stats(radix_tree *tree)
{
+ rt_node_ptr root = rt_node_ptr_encoded(tree->root);
+
ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
tree->num_keys,
- tree->root->shift / RT_NODE_SPAN,
+ NODE_SHIFT(root) / RT_NODE_SPAN,
tree->cnt[0],
tree->cnt[1],
tree->cnt[2],
@@ -2173,42 +2253,44 @@ rt_stats(radix_tree *tree)
}
static void
-rt_dump_node(rt_node *node, int level, bool recurse)
+rt_dump_node(rt_node_ptr node, int level, bool recurse)
{
+ rt_node *n = node.decoded;
char space[128] = {0};
fprintf(stderr, "[%s] kind %d, count %u, shift %u, chunk 0x%X:\n",
NODE_IS_LEAF(node) ? "LEAF" : "INNR",
- (node->kind == RT_NODE_KIND_4) ? 4 :
- (node->kind == RT_NODE_KIND_32) ? 32 :
- (node->kind == RT_NODE_KIND_128) ? 128 : 256,
- node->count, node->shift, node->chunk);
+ (NODE_KIND(node) == RT_NODE_KIND_4) ? 4 :
+ (NODE_KIND(node) == RT_NODE_KIND_32) ? 32 :
+ (NODE_KIND(node) == RT_NODE_KIND_128) ? 128 : 256,
+ n->count, n->shift, n->chunk);
if (level > 0)
sprintf(space, "%*c", level * 4, ' ');
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- for (int i = 0; i < node->count; i++)
+ for (int i = 0; i < NODE_COUNT(node); i++)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
space, n4->base.chunks[i], n4->values[i]);
}
else
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
fprintf(stderr, "%schunk 0x%X ->",
space, n4->base.chunks[i]);
if (recurse)
- rt_dump_node(n4->children[i], level + 1, recurse);
+ rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+ level + 1, recurse);
else
fprintf(stderr, "\n");
}
@@ -2217,25 +2299,26 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
case RT_NODE_KIND_32:
{
- for (int i = 0; i < node->count; i++)
+ for (int i = 0; i < NODE_KIND(node); i++)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
space, n32->base.chunks[i], n32->values[i]);
}
else
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
fprintf(stderr, "%schunk 0x%X ->",
space, n32->base.chunks[i]);
if (recurse)
{
- rt_dump_node(n32->children[i], level + 1, recurse);
+ rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+ level + 1, recurse);
}
else
fprintf(stderr, "\n");
@@ -2245,7 +2328,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
case RT_NODE_KIND_128:
{
- rt_node_base_128 *b128 = (rt_node_base_128 *) node;
+ rt_node_base_128 *b128 = (rt_node_base_128 *) node.decoded;
fprintf(stderr, "slot_idxs ");
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2257,7 +2340,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_128 *n = (rt_node_leaf_128 *) node;
+ rt_node_leaf_128 *n = (rt_node_leaf_128 *) node.decoded;
fprintf(stderr, ", isset-bitmap:");
for (int i = 0; i < 16; i++)
@@ -2287,7 +2370,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(node_inner_128_get_child(n128, i),
+ rt_dump_node(rt_node_ptr_encoded(node_inner_128_get_child(n128, i)),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2301,7 +2384,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
if (!node_leaf_256_is_chunk_used(n256, i))
continue;
@@ -2311,7 +2394,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
else
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
if (!node_inner_256_is_chunk_used(n256, i))
continue;
@@ -2320,8 +2403,8 @@ rt_dump_node(rt_node *node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
- recurse);
+ rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+ level + 1, recurse);
else
fprintf(stderr, "\n");
}
@@ -2334,14 +2417,14 @@ rt_dump_node(rt_node *node, int level, bool recurse)
void
rt_dump_search(radix_tree *tree, uint64 key)
{
- rt_node *node;
+ rt_node_ptr node;
int shift;
int level = 0;
elog(NOTICE, "-----------------------------------------------------------");
elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
- if (!tree->root)
+ if (!RTPointerIsValid(tree->root))
{
elog(NOTICE, "tree is empty");
return;
@@ -2354,11 +2437,11 @@ rt_dump_search(radix_tree *tree, uint64 key)
return;
}
- node = tree->root;
- shift = tree->root->shift;
+ node = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
while (shift >= 0)
{
- rt_node *child;
+ rt_pointer child;
rt_dump_node(node, level, false);
@@ -2375,7 +2458,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
break;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
level++;
}
@@ -2384,6 +2467,8 @@ rt_dump_search(radix_tree *tree, uint64 key)
void
rt_dump(radix_tree *tree)
{
+ rt_node_ptr root;
+
for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
rt_node_kind_info[i].name,
@@ -2393,12 +2478,13 @@ rt_dump(radix_tree *tree)
rt_node_kind_info[i].leaf_blocksize);
fprintf(stderr, "max_val = %lu\n", tree->max_val);
- if (!tree->root)
+ if (!RTPointerIsValid(tree->root))
{
fprintf(stderr, "empty tree\n");
return;
}
- rt_dump_node(tree->root, 0, true);
+ root = rt_node_ptr_encoded(tree->root);
+ rt_dump_node(root, 0, true);
}
#endif
--
2.31.1
v10-0005-PoC-tag-the-node-kind-to-rt_pointer.patchapplication/octet-stream; name=v10-0005-PoC-tag-the-node-kind-to-rt_pointer.patchDownload
From 180ee3a0691bd1c7986f41dfec51673891e5cc06 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 17 Nov 2022 11:16:06 +0900
Subject: [PATCH v10 5/7] PoC: tag the node kind to rt_pointer.
---
src/backend/lib/radixtree.c | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 67f4dc646e..08d580a899 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -141,6 +141,8 @@ typedef enum
typedef uintptr_t rt_pointer;
#define InvalidRTPointer ((rt_pointer) 0)
#define RTPointerIsValid(x) (((rt_pointer) (x)) != InvalidRTPointer)
+#define RTPointerTagKind(x, k) ((rt_pointer) (x) | ((k) & RT_POINTER_KIND_MASK))
+#define RTPointerUnTagKind(x) ((rt_pointer) (x) & ~RT_POINTER_KIND_MASK)
/* Common type for all nodes types */
typedef struct rt_node
@@ -159,8 +161,10 @@ typedef struct rt_node
uint8 shift;
uint8 chunk;
- /* Size kind of the node */
- uint8 kind;
+ /*
+ * The node kind is tagged into the rt_pointer, see the comments of
+ * rt_pointer for details.
+ */
} rt_node;
#define RT_NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
@@ -303,7 +307,7 @@ typedef struct rt_node_ptr
#define NODE_RAW(n) (((rt_node_ptr) (n)).decoded)
#define NODE_IS_LEAF(n) (NODE_RAW(n)->shift == 0)
#define NODE_IS_EMPTY(n) (NODE_COUNT(n) == 0)
-#define NODE_KIND(n) (NODE_RAW(n)->kind)
+#define NODE_KIND(n) ((uint8) (((rt_node_ptr) (n)).encoded & RT_POINTER_KIND_MASK))
#define NODE_COUNT(n) (NODE_RAW(n)->count)
#define NODE_SHIFT(n) (NODE_RAW(n)->shift)
#define NODE_CHUNK(n) (NODE_RAW(n)->chunk)
@@ -444,13 +448,13 @@ static void rt_verify_node(rt_node_ptr node);
static inline rt_node *
rt_pointer_decode(rt_pointer encoded)
{
- return (rt_node *) encoded;
+ return (rt_node *) RTPointerUnTagKind(encoded);
}
static inline rt_pointer
-rt_pointer_encode(rt_node *decoded)
+rt_pointer_encode(rt_node *decoded, uint8 kind)
{
- return (rt_pointer) decoded;
+ return (rt_pointer) RTPointerTagKind(decoded, kind);
}
/* Return a rt_pointer created from the given encoded pointer */
@@ -923,8 +927,7 @@ rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
newnode.decoded = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
rt_node_kind_info[kind].leaf_size);
- newnode.encoded = rt_pointer_encode(newnode.decoded);
- NODE_KIND(newnode) = kind;
+ newnode.encoded = rt_pointer_encode(newnode.decoded, kind);
NODE_SHIFT(newnode) = shift;
NODE_CHUNK(newnode) = chunk;
--
2.31.1
v10-0007-PoC-lazy-vacuum-integration.patchapplication/octet-stream; name=v10-0007-PoC-lazy-vacuum-integration.patchDownload
From 2e6cc9188b06ec7ed548fe556bc1402bf1b88976 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 4 Nov 2022 14:14:42 +0900
Subject: [PATCH v10 7/7] PoC: lazy vacuum integration.
The patch includes:
* Introducing a new module called TIDStore
* Lazy vacuum and parallel vacuum integration.
TODOs:
* radix tree needs to have the reset funtionality.
* should not allow TIDStore to grow beyond the memory limit.
* change the progress statistics of pg_stat_progress_vacuum.
---
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 280 ++++++++++++++++++++++++++
src/backend/access/heap/vacuumlazy.c | 160 +++++----------
src/backend/commands/vacuum.c | 76 +------
src/backend/commands/vacuumparallel.c | 63 +++---
src/backend/storage/lmgr/lwlock.c | 2 +
src/include/access/tidstore.h | 55 +++++
src/include/commands/vacuum.h | 24 +--
src/include/storage/lwlock.h | 1 +
10 files changed, 437 insertions(+), 226 deletions(-)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index 857beaa32d..76265974b1 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -13,6 +13,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..50ec800fd6
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * TID (ItemPointer) storage implementation.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "lib/radixtree.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* XXX: should be configurable for non-heap AMs */
+#define TIDSTORE_OFFSET_NBITS 11 /* pg_ceil_log2_32(MaxHeapTuplesPerPage) */
+
+#define TIDSTORE_VALUE_NBITS 6 /* log(sizeof(uint64) * BITS_PER_BYTE, 2) */
+
+/* Get block number from the key */
+#define KEY_GET_BLKNO(key) \
+ ((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+struct TIDStore
+{
+ /* main storage for TID */
+ radix_tree *tree;
+
+ /* # of tids in TIDStore */
+ int num_tids;
+
+ /* DSA area and handle for shared TIDStore */
+ rt_handle handle;
+ dsa_area *area;
+};
+
+static void tidstore_iter_collect_tids(TIDStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TIDStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TIDStore *
+tidstore_create(dsa_area *area)
+{
+ TIDStore *ts;
+
+ ts = palloc0(sizeof(TIDStore));
+
+ ts->tree = rt_create(CurrentMemoryContext, area);
+ ts->area = area;
+
+ if (area != NULL)
+ ts->handle = rt_get_handle(ts->tree);
+
+ return ts;
+}
+
+/* Attach to the shared TIDStore using a handle */
+TIDStore *
+tidstore_attach(dsa_area *area, rt_handle handle)
+{
+ TIDStore *ts;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ ts = palloc0(sizeof(TIDStore));
+ ts->tree = rt_attach(area, handle);
+
+ return ts;
+}
+
+/*
+ * Detach from a TIDStore. This detaches from radix tree and frees the
+ * backend-local resources.
+ */
+void
+tidstore_detach(TIDStore *ts)
+{
+ rt_detach(ts->tree);
+ pfree(ts);
+}
+
+void
+tidstore_free(TIDStore *ts)
+{
+ rt_free(ts->tree);
+ pfree(ts);
+}
+
+void
+tidstore_reset(TIDStore *ts)
+{
+ dsa_area *area = ts->area;
+
+ /* Reset the statistics */
+ ts->num_tids = 0;
+
+ /* Recreate radix tree storage */
+ rt_free(ts->tree);
+ ts->tree = rt_create(CurrentMemoryContext, area);
+}
+
+/* Add TIDs to TIDStore */
+void
+tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 key;
+ uint64 val = 0;
+ ItemPointerData tid;
+
+ ItemPointerSetBlockNumber(&tid, blkno);
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint32 off;
+
+ ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+ key = tid_to_key_off(&tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(ts->tree, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= UINT64CONST(1) << off;
+ ts->num_tids++;
+ }
+
+ if (last_key != PG_UINT64_MAX)
+ {
+ rt_set(ts->tree, last_key, val);
+ val = 0;
+ }
+}
+
+/* Return true if the given TID is present in TIDStore */
+bool
+tidstore_lookup_tid(TIDStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val;
+ uint32 off;
+ bool found;
+
+ key = tid_to_key_off(tid, &off);
+
+ found = rt_search(ts->tree, key, &val);
+
+ if (!found)
+ return false;
+
+ return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+TIDStoreIter *
+tidstore_begin_iterate(TIDStore *ts)
+{
+ TIDStoreIter *iter;
+
+ iter = palloc0(sizeof(TIDStoreIter));
+ iter->ts = ts;
+ iter->tree_iter = rt_begin_iterate(ts->tree);
+ iter->blkno = InvalidBlockNumber;
+
+ return iter;
+}
+
+bool
+tidstore_iterate_next(TIDStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+
+ if (iter->finished)
+ return false;
+
+ if (BlockNumberIsValid(iter->blkno))
+ {
+ iter->num_offsets = 0;
+ tidstore_iter_collect_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (rt_iterate_next(iter->tree_iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = KEY_GET_BLKNO(key);
+
+ if (BlockNumberIsValid(iter->blkno) && iter->blkno != blkno)
+ {
+ /*
+ * Remember the key-value pair for the next block for the
+ * next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+ return true;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_collect_tids(iter, key, val);
+ }
+
+ iter->finished = true;
+ return true;
+}
+
+uint64
+tidstore_num_tids(TIDStore *ts)
+{
+ return ts->num_tids;
+}
+
+uint64
+tidstore_memory_usage(TIDStore *ts)
+{
+ return (uint64) sizeof(TIDStore) + rt_memory_usage(ts->tree);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TIDStore
+ */
+tidstore_handle
+tidstore_get_handle(TIDStore *ts)
+{
+ return rt_get_handle(ts->tree);
+}
+
+/* Extract TIDs from key-value pair */
+static void
+tidstore_iter_collect_tids(TIDStoreIter *iter, uint64 key, uint64 val)
+{
+ for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ if ((val & (UINT64CONST(1) << i)) == 0)
+ continue;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= i;
+
+ off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+ iter->offsets[iter->num_offsets++] = off;
+ }
+
+ iter->blkno = KEY_GET_BLKNO(key);
+}
+
+/* Encode a TID to key and val */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint64 tid_i;
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+ *off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+ upper = tid_i >> TIDSTORE_VALUE_NBITS;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ return upper;
+}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 834ab83a0e..cda405dd99 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -144,6 +145,8 @@ typedef struct LVRelState
Relation *indrels;
int nindexes;
+ int max_bytes;
+
/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
bool aggressive;
/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -194,7 +197,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TIDStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -265,8 +268,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer *vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer *vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -397,6 +401,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
vacrel->indname = NULL;
vacrel->phase = VACUUM_ERRCB_PHASE_UNKNOWN;
vacrel->verbose = verbose;
+ vacrel->max_bytes = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
errcallback.callback = vacuum_error_callback;
errcallback.arg = vacrel;
errcallback.previous = error_context_stack;
@@ -858,7 +865,7 @@ lazy_scan_heap(LVRelState *vacrel)
next_unskippable_block,
next_failsafe_block = 0,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TIDStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
@@ -872,7 +879,7 @@ lazy_scan_heap(LVRelState *vacrel)
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = vacrel->max_bytes; /* XXX: should use # of tids */
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -942,8 +949,8 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ /* XXX: should not allow tidstore to grow beyond max_bytes */
+ if (tidstore_memory_usage(vacrel->dead_items) > vacrel->max_bytes)
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -1075,11 +1082,17 @@ lazy_scan_heap(LVRelState *vacrel)
if (prunestate.has_lpdead_items)
{
Size freespace;
+ TIDStoreIter *iter;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ tidstore_iterate_next(iter);
+ lazy_vacuum_heap_page(vacrel, blkno, iter->offsets, iter->num_offsets,
+ buf, &vmbuffer);
+ Assert(!tidstore_iterate_next(iter));
+ pfree(iter);
/* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ tidstore_reset(dead_items);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1116,7 +1129,7 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(tidstore_num_tids(dead_items) == 0);
}
/*
@@ -1269,7 +1282,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (tidstore_num_tids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1868,25 +1881,16 @@ retry:
*/
if (lpdead_items > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TIDStore *dead_items = vacrel->dead_items;
Assert(!prunestate->all_visible);
Assert(prunestate->has_lpdead_items);
vacrel->lpdead_item_pages++;
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ tidstore_num_tids(dead_items));
}
/* Finally, add page-local counts to whole-VACUUM counts */
@@ -2093,8 +2097,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TIDStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2103,17 +2106,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- Assert(dead_items->num_items <= dead_items->max_items);
pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ tidstore_num_tids(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2162,7 +2158,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
return;
}
@@ -2191,7 +2187,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2218,8 +2214,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2264,7 +2260,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ /* tidstore_reset(vacrel->dead_items); */
}
/*
@@ -2336,7 +2332,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2373,10 +2369,10 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index;
BlockNumber vacuumed_pages;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TIDStoreIter *iter;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2393,8 +2389,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuumed_pages = 0;
- index = 0;
- while (index < vacrel->dead_items->num_items)
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ while (tidstore_iterate_next(iter))
{
BlockNumber tblk;
Buffer buf;
@@ -2403,12 +2399,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- tblk = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ tblk = iter->blkno;
vacrel->blkno = tblk;
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, tblk, buf, index, &vmbuffer);
+ lazy_vacuum_heap_page(vacrel, tblk, iter->offsets, iter->num_offsets,
+ buf, &vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2432,9 +2429,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
@@ -2456,11 +2452,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* LP_DEAD item on the page. The return value is the first index immediately
* after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer *vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets, Buffer buffer, Buffer *vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int uncnt = 0;
@@ -2479,16 +2474,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = offsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2568,7 +2558,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -3070,46 +3059,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
return vacrel->nonempty_pages;
}
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
/*
* Allocate dead_items (either using palloc, or in dynamic shared memory).
* Sets dead_items in vacrel for caller.
@@ -3120,12 +3069,6 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
-
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
* be used for an index, so we invoke parallelism only if there are at
@@ -3151,7 +3094,6 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3164,11 +3106,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = tidstore_create(NULL);
}
/*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 3c8ea21475..effb72cdd6 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -95,7 +95,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int vac_cmp_itemptr(const void *left, const void *right);
/*
* Primary entry point for manual VACUUM and ANALYZE commands
@@ -2295,16 +2294,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TIDStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ tidstore_num_tids(dead_items))));
return istat;
}
@@ -2335,18 +2334,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
@@ -2357,60 +2344,7 @@ vac_max_items_to_alloc_size(int max_items)
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch((void *) itemptr,
- (void *) dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
-
- return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
- BlockNumber lblk,
- rblk;
- OffsetNumber loff,
- roff;
-
- lblk = ItemPointerGetBlockNumber((ItemPointer) left);
- rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
- if (lblk < rblk)
- return -1;
- if (lblk > rblk)
- return 1;
-
- loff = ItemPointerGetOffsetNumber((ItemPointer) left);
- roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
- if (loff < roff)
- return -1;
- if (loff > roff)
- return 1;
+ TIDStore *dead_items = (TIDStore *) state;
- return 0;
+ return tidstore_lookup_tid(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index f26d796e52..070503f662 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
* use small integers.
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 3
#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 4
#define PARALLEL_VACUUM_KEY_WAL_USAGE 5
@@ -103,6 +103,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TIDStore */
+ tidstore_handle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
+ dsa_area *dead_items_area;;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -222,20 +226,22 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -283,9 +289,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -351,6 +356,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = tidstore_create(dead_items_dsa);
+ pvs->dead_items = dead_items;
+ pvs->dead_items_area = dead_items_dsa;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -360,6 +375,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = tidstore_get_handle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +384,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -434,6 +441,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ tidstore_free(pvs->dead_items);
+ dsa_detach(pvs->dead_items_area);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -442,7 +452,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TIDStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -940,7 +950,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -984,10 +996,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1045,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ tidstore_detach(pvs.dead_items);
+ dsa_detach(dead_items_area);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a5ad36ca78..2fb30fe2e7 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -183,6 +183,8 @@ static const char *const BuiltinTrancheNames[] = {
"PgStatsHash",
/* LWTRANCHE_PGSTATS_DATA: */
"PgStatsData",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..40b8021f9b
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,55 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * TID storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "lib/radixtree.h"
+#include "storage/itemptr.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TIDStore TIDStore;
+
+typedef struct TIDStoreIter
+{
+ TIDStore *ts;
+
+ rt_iter *tree_iter;
+
+ bool finished;
+
+ uint64 next_key;
+ uint64 next_val;
+
+ BlockNumber blkno;
+ OffsetNumber offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+ int num_offsets;
+} TIDStoreIter;
+
+extern TIDStore *tidstore_create(dsa_area *dsa);
+extern TIDStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TIDStore *ts);
+extern void tidstore_free(TIDStore *ts);
+extern void tidstore_reset(TIDStore *ts);
+extern void tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TIDStore *ts, ItemPointer tid);
+extern TIDStoreIter * tidstore_begin_iterate(TIDStore *ts);
+extern bool tidstore_iterate_next(TIDStoreIter *iter);
+extern uint64 tidstore_num_tids(TIDStore *ts);
+extern uint64 tidstore_memory_usage(TIDStore *ts);
+extern tidstore_handle tidstore_get_handle(TIDStore *ts);
+
+#endif /* TIDSTORE_H */
+
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 5d816ba7f4..d221528f16 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -235,21 +236,6 @@ typedef struct VacuumParams
int nworkers;
} VacuumParams;
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -306,18 +292,16 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TIDStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TIDStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index a494cb598f..88e35254d1 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -201,6 +201,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DSA,
LWTRANCHE_PGSTATS_HASH,
LWTRANCHE_PGSTATS_DATA,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
--
2.31.1
v10-0006-PoC-DSA-support-for-radix-tree.patchapplication/octet-stream; name=v10-0006-PoC-DSA-support-for-radix-tree.patchDownload
From b85513ab0f8654df36aa913f4b29b626e652943f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 27 Oct 2022 14:02:00 +0900
Subject: [PATCH v10 6/7] PoC: DSA support for radix tree.
---
.../bench_radix_tree--1.0.sql | 2 +
contrib/bench_radix_tree/bench_radix_tree.c | 12 +-
src/backend/lib/radixtree.c | 483 +++++++++++++-----
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 8 +-
src/include/utils/dsa.h | 1 +
.../expected/test_radixtree.out | 17 +
.../modules/test_radixtree/test_radixtree.c | 100 ++--
8 files changed, 482 insertions(+), 153 deletions(-)
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index e0205b364e..b5f731f329 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -7,6 +7,7 @@ create function bench_shuffle_search(
minblk int4,
maxblk int4,
random_block bool DEFAULT false,
+shared bool DEFAULT false,
OUT nkeys int8,
OUT rt_mem_allocated int8,
OUT array_mem_allocated int8,
@@ -23,6 +24,7 @@ create function bench_seq_search(
minblk int4,
maxblk int4,
random_block bool DEFAULT false,
+shared bool DEFAULT false,
OUT nkeys int8,
OUT rt_mem_allocated int8,
OUT array_mem_allocated int8,
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 70ca989118..225a1b3bb1 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -15,6 +15,7 @@
#include "lib/radixtree.h"
#include <math.h>
#include "miscadmin.h"
+#include "storage/lwlock.h"
#include "utils/timestamp.h"
PG_MODULE_MAGIC;
@@ -150,7 +151,9 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
BlockNumber minblk = PG_GETARG_INT32(0);
BlockNumber maxblk = PG_GETARG_INT32(1);
bool random_block = PG_GETARG_BOOL(2);
+ bool shared = PG_GETARG_BOOL(3);
radix_tree *rt = NULL;
+ dsa_area *dsa = NULL;
uint64 ntids;
uint64 key;
uint64 last_key = PG_UINT64_MAX;
@@ -172,8 +175,11 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+ if (shared)
+ dsa = dsa_create(LWLockNewTrancheId());
+
/* measure the load time of the radix tree */
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, dsa);
start_time = GetCurrentTimestamp();
for (int i = 0; i < ntids; i++)
{
@@ -324,7 +330,7 @@ bench_load_random_int(PG_FUNCTION_ARGS)
elog(ERROR, "return type must be a row type");
pg_prng_seed(&state, 0);
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
start_time = GetCurrentTimestamp();
for (uint64 i = 0; i < cnt; i++)
@@ -450,7 +456,7 @@ bench_fixed_height_search(PG_FUNCTION_ARGS)
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
start_time = GetCurrentTimestamp();
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 08d580a899..1f2bb95e24 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -22,6 +22,15 @@
* choose it to avoid an additional pointer traversal. It is the reason this code
* currently does not support variable-length keys.
*
+ * If DSA space is specified when rt_create(), the radix tree is created in the
+ * DSA space so that multiple processes can access to it simultaneously. The process
+ * who created the shared radix tree need to tell both DSA area specified when
+ * calling to rt_create() and dsa_pointer of the radix tree, fetched by
+ * rt_get_dsa_pointer(), other processes so that they can attach by rt_attach().
+ *
+ * XXX: shared radix tree is still PoC state as it doesn't have any locking support.
+ * Also, it supports only single-process iteration.
+ *
* XXX: Most functions in this file have two variants for inner nodes and leaf
* nodes, therefore there are duplication codes. While this sometimes makes the
* code maintenance tricky, this reduces branch prediction misses when judging
@@ -34,6 +43,9 @@
*
* rt_create - Create a new, empty radix tree
* rt_free - Free the radix tree
+ * rt_attach - Attach to the radix tree
+ * rt_detach - Detach from the radix tree
+ * rt_get_handle - Return the handle of the radix tree
* rt_search - Search a key-value pair
* rt_set - Set a key-value pair
* rt_delete - Delete a key-value pair
@@ -64,6 +76,7 @@
#include "miscadmin.h"
#include "port/pg_bitutils.h"
#include "port/pg_lfind.h"
+#include "utils/dsa.h"
#include "utils/memutils.h"
/* The number of bits encoded in one tree level */
@@ -384,6 +397,11 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
* construct the key whenever updating the node iteration information, e.g., when
* advancing the current index within the node or when moving to the next node
* at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than rt_node_ptr.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
*/
typedef struct rt_node_iter
{
@@ -403,23 +421,43 @@ struct rt_iter
uint64 key;
};
-/* A radix tree with nodes */
-struct radix_tree
+/* A magic value used to identify our radix tree */
+#define RADIXTREE_MAGIC 0x54A48167
+
+/* Control information for an radix tree */
+typedef struct radix_tree_control
{
- MemoryContext context;
+ rt_handle handle;
+ uint32 magic;
+ /* Root node */
rt_pointer root;
- uint64 max_val;
- uint64 num_keys;
- MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
- MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+ pg_atomic_uint64 max_val;
+ pg_atomic_uint64 num_keys;
/* statistics */
#ifdef RT_DEBUG
int32 cnt[RT_NODE_KIND_COUNT];
#endif
+} radix_tree_control;
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ /* control object in either backend-local memory or DSA */
+ radix_tree_control *ctl;
+
+ /* used only when the radix tree is shared */
+ dsa_area *area;
+
+ /* used only when the radix tree is private */
+ MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+ MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
};
+#define RadixTreeIsShared(rt) ((rt)->area != NULL)
static void rt_new_root(radix_tree *tree, uint64 key);
static rt_node_ptr rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
@@ -446,24 +484,31 @@ static void rt_verify_node(rt_node_ptr node);
/* Decode and encode function of rt_pointer */
static inline rt_node *
-rt_pointer_decode(rt_pointer encoded)
+rt_pointer_decode(radix_tree *tree, rt_pointer encoded)
{
- return (rt_node *) RTPointerUnTagKind(encoded);
+ encoded = RTPointerUnTagKind(encoded);
+
+ if (RadixTreeIsShared(tree))
+ return (rt_node *) dsa_get_address(tree->area, encoded);
+ else
+ return (rt_node *) encoded;
}
static inline rt_pointer
-rt_pointer_encode(rt_node *decoded, uint8 kind)
+rt_pointer_encode(rt_pointer decoded, uint8 kind)
{
+ Assert((decoded & RT_POINTER_KIND_MASK) == 0);
+
return (rt_pointer) RTPointerTagKind(decoded, kind);
}
/* Return a rt_pointer created from the given encoded pointer */
static inline rt_node_ptr
-rt_node_ptr_encoded(rt_pointer encoded)
+rt_node_ptr_encoded(radix_tree *tree, rt_pointer encoded)
{
return (rt_node_ptr) {
.encoded = encoded,
- .decoded = rt_pointer_decode(encoded)
+ .decoded = rt_pointer_decode(tree, encoded)
};
}
@@ -908,8 +953,8 @@ rt_new_root(radix_tree *tree, uint64 key)
rt_node_ptr node;
node = rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0, shift > 0);
- tree->max_val = shift_get_max_val(shift);
- tree->root = node.encoded;
+ pg_atomic_write_u64(&tree->ctl->max_val, shift_get_max_val(shift));
+ tree->ctl->root = node.encoded;
}
/*
@@ -918,16 +963,35 @@ rt_new_root(radix_tree *tree, uint64 key)
static rt_node_ptr
rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
{
- rt_node_ptr newnode;
+ rt_node_ptr newnode;
+
+ if (tree->area != NULL)
+ {
+ dsa_pointer dp;
+
+ if (inner)
+ dp = dsa_allocate0(tree->area, rt_node_kind_info[kind].inner_size);
+ else
+ dp = dsa_allocate0(tree->area, rt_node_kind_info[kind].leaf_size);
- if (inner)
- newnode.decoded = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
- rt_node_kind_info[kind].inner_size);
+ newnode.encoded = rt_pointer_encode((rt_pointer) dp, kind);
+ newnode.decoded = (rt_node *) dsa_get_address(tree->area, dp);
+ }
else
- newnode.decoded = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
- rt_node_kind_info[kind].leaf_size);
+ {
+ rt_node *new;
+
+ if (inner)
+ new = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_kind_info[kind].inner_size);
+ else
+ new = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_kind_info[kind].leaf_size);
+
+ newnode.encoded = rt_pointer_encode((rt_pointer) new, kind);
+ newnode.decoded = new;
+ }
- newnode.encoded = rt_pointer_encode(newnode.decoded, kind);
NODE_SHIFT(newnode) = shift;
NODE_CHUNK(newnode) = chunk;
@@ -941,7 +1005,7 @@ rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
#ifdef RT_DEBUG
/* update the statistics */
- tree->cnt[kind]++;
+ tree->ctl->cnt[kind]++;
#endif
return newnode;
@@ -968,16 +1032,19 @@ static void
rt_free_node(radix_tree *tree, rt_node_ptr node)
{
/* If we're deleting the root node, make the tree empty */
- if (tree->root == node.encoded)
- tree->root = InvalidRTPointer;
+ if (tree->ctl->root == node.encoded)
+ tree->ctl->root = InvalidRTPointer;
#ifdef RT_DEBUG
/* update the statistics */
- tree->cnt[NODE_KIND(node)]--;
- Assert(tree->cnt[NODE_KIND(node)] >= 0);
+ tree->ctl->cnt[NODE_KIND(node)]--;
+ Assert(tree->ctl->cnt[NODE_KIND(node)] >= 0);
#endif
- pfree(node.decoded);
+ if (RadixTreeIsShared(tree))
+ dsa_free(tree->area, (dsa_pointer) RTPointerUnTagKind(node.encoded));
+ else
+ pfree(node.decoded);
}
/*
@@ -993,7 +1060,7 @@ rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
if (rt_node_ptr_eq(&parent, &old_child))
{
/* Replace the root node with the new large node */
- tree->root = new_child.encoded;
+ tree->ctl->root = new_child.encoded;
}
else
{
@@ -1015,7 +1082,7 @@ static void
rt_extend(radix_tree *tree, uint64 key)
{
int target_shift;
- rt_node *root = rt_pointer_decode(tree->root);
+ rt_node *root = rt_pointer_decode(tree, tree->ctl->root);
int shift = root->shift + RT_NODE_SPAN;
target_shift = key_get_shift(key);
@@ -1031,15 +1098,15 @@ rt_extend(radix_tree *tree, uint64 key)
n4->base.n.count = 1;
n4->base.chunks[0] = 0;
- n4->children[0] = tree->root;
+ n4->children[0] = tree->ctl->root;
root->chunk = 0;
- tree->root = node.encoded;
+ tree->ctl->root = node.encoded;
shift += RT_NODE_SPAN;
}
- tree->max_val = shift_get_max_val(target_shift);
+ pg_atomic_write_u64(&tree->ctl->max_val, shift_get_max_val(target_shift));
}
/*
@@ -1068,7 +1135,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
}
rt_node_insert_leaf(tree, parent, node, key, value);
- tree->num_keys++;
+ pg_atomic_add_fetch_u64(&tree->ctl->num_keys, 1);
}
/*
@@ -1079,8 +1146,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
* pointer is set to child_p.
*/
static inline bool
-rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
- rt_pointer *child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action, rt_pointer *child_p)
{
uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool found = false;
@@ -1115,6 +1181,7 @@ rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
break;
found = true;
+
if (action == RT_ACTION_FIND)
child = n32->children[idx];
else /* RT_ACTION_DELETE */
@@ -1604,33 +1671,50 @@ rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
* Create the radix tree in the given memory context and return it.
*/
radix_tree *
-rt_create(MemoryContext ctx)
+rt_create(MemoryContext ctx, dsa_area *area)
{
radix_tree *tree;
MemoryContext old_ctx;
old_ctx = MemoryContextSwitchTo(ctx);
- tree = palloc(sizeof(radix_tree));
+ tree = (radix_tree *) palloc0(sizeof(radix_tree));
tree->context = ctx;
- tree->root = InvalidRTPointer;
- tree->max_val = 0;
- tree->num_keys = 0;
+
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+
+ tree->area = area;
+ dp = dsa_allocate0(area, sizeof(radix_tree_control));
+ tree->ctl = (radix_tree_control *) dsa_get_address(area, dp);
+ tree->ctl->handle = (rt_handle) dp;
+ }
+ else
+ {
+ tree->ctl = (radix_tree_control *) palloc0(sizeof(radix_tree_control));
+ tree->ctl->handle = InvalidDsaPointer;
+ }
+
+ tree->ctl->magic = RADIXTREE_MAGIC;
+ tree->ctl->root = InvalidRTPointer;
+ pg_atomic_init_u64(&tree->ctl->max_val, 0);
+ pg_atomic_init_u64(&tree->ctl->num_keys, 0);
/* Create the slab allocator for each size class */
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ if (area == NULL)
{
- tree->inner_slabs[i] = SlabContextCreate(ctx,
- rt_node_kind_info[i].name,
- rt_node_kind_info[i].inner_blocksize,
- rt_node_kind_info[i].inner_size);
- tree->leaf_slabs[i] = SlabContextCreate(ctx,
- rt_node_kind_info[i].name,
- rt_node_kind_info[i].leaf_blocksize,
- rt_node_kind_info[i].leaf_size);
-#ifdef RT_DEBUG
- tree->cnt[i] = 0;
-#endif
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].inner_blocksize,
+ rt_node_kind_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].leaf_blocksize,
+ rt_node_kind_info[i].leaf_size);
+ }
}
MemoryContextSwitchTo(old_ctx);
@@ -1638,16 +1722,160 @@ rt_create(MemoryContext ctx)
return tree;
}
+/*
+ * Get a handle that can be used by other processes to attach to this radix
+ * tree.
+ */
+dsa_pointer
+rt_get_handle(radix_tree *tree)
+{
+ Assert(RadixTreeIsShared(tree));
+ Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+ return tree->ctl->handle;
+}
+
+/*
+ * Attach to an existing radix tree using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+radix_tree *
+rt_attach(dsa_area *area, rt_handle handle)
+{
+ radix_tree *tree;
+ dsa_pointer control;
+
+ /* Allocate the backend-local object representing the radix tree */
+ tree = (radix_tree *) palloc0(sizeof(radix_tree));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ /* Set up the local radix tree */
+ tree->area = area;
+ tree->ctl = (radix_tree_control *) dsa_get_address(area, control);
+ Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+ return tree;
+}
+
+/*
+ * Detach from a radix tree. This frees backend-local resources associated
+ * with the radix tree, but the radix tree will continue to exist until
+ * it is explicitly freed.
+ */
+void
+rt_detach(radix_tree *tree)
+{
+ Assert(RadixTreeIsShared(tree));
+ Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+ pfree(tree);
+}
+
+/*
+ * Recursively free all nodes allocated to the dsa area.
+ */
+static void
+rt_free_recurse(radix_tree *tree, rt_pointer ptr)
+{
+ rt_node_ptr node = rt_node_ptr_encoded(tree, ptr);
+
+ Assert(RadixTreeIsShared(tree));
+
+ /* The leaf node doesn't have child pointers, so free it */
+ if (NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->area, RTPointerUnTagKind(node.encoded));
+ return;
+ }
+
+ switch (NODE_KIND(node))
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < NODE_COUNT(node); i++)
+ rt_free_recurse(tree, n4->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < NODE_COUNT(node); i++)
+ rt_free_recurse(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ continue;
+
+ rt_free_recurse(tree, node_inner_128_get_child(n128, i));
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_inner_256_is_chunk_used(n256, i))
+ continue;
+
+ rt_free_recurse(tree, node_inner_256_get_child(n256, i));
+ }
+ break;
+ }
+ }
+
+ /* Free the inner node itself */
+ dsa_free(tree->area, RTPointerUnTagKind(node.encoded));
+}
+
/*
* Free the given radix tree.
*/
void
rt_free(radix_tree *tree)
{
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+ if (RadixTreeIsShared(tree))
{
- MemoryContextDelete(tree->inner_slabs[i]);
- MemoryContextDelete(tree->leaf_slabs[i]);
+ /* Free all memory used for radix tree nodes */
+ if (RTPointerIsValid(tree->ctl->root))
+ rt_free_recurse(tree, tree->ctl->root);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->area, tree->ctl->handle);
+ }
+ else
+ {
+ /* Free all memory used for radix tree nodes */
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+ pfree(tree->ctl);
}
pfree(tree);
@@ -1665,16 +1893,18 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
rt_node_ptr node;
rt_node_ptr parent;
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
/* Empty tree, create the root */
- if (!RTPointerIsValid(tree->root))
+ if (!RTPointerIsValid(tree->ctl->root))
rt_new_root(tree, key);
/* Extend the tree if necessary */
- if (key > tree->max_val)
+ if (key > pg_atomic_read_u64(&tree->ctl->max_val))
rt_extend(tree, key);
/* Descend the tree until a leaf node */
- node = parent = rt_node_ptr_encoded(tree->root);
+ node = parent = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
while (shift >= 0)
{
@@ -1690,7 +1920,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
}
parent = node;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
}
@@ -1698,7 +1928,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
/* Update the statistics */
if (!updated)
- tree->num_keys++;
+ pg_atomic_add_fetch_u64(&tree->ctl->num_keys, 1);
return updated;
}
@@ -1714,12 +1944,14 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
rt_node_ptr node;
int shift;
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
Assert(value_p != NULL);
- if (!RTPointerIsValid(tree->root) || key > tree->max_val)
+ if (!RTPointerIsValid(tree->ctl->root) ||
+ key > pg_atomic_read_u64(&tree->ctl->max_val))
return false;
- node = rt_node_ptr_encoded(tree->root);
+ node = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
/* Descend the tree until a leaf node */
@@ -1733,7 +1965,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
}
@@ -1753,14 +1985,17 @@ rt_delete(radix_tree *tree, uint64 key)
int level;
bool deleted;
- if (!tree->root || key > tree->max_val)
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+ if (!RTPointerIsValid(tree->ctl->root) ||
+ key > pg_atomic_read_u64(&tree->ctl->max_val))
return false;
/*
* Descend the tree to search the key while building a stack of nodes we
* visited.
*/
- node = rt_node_ptr_encoded(tree->root);
+ node = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
level = -1;
while (shift > 0)
@@ -1773,7 +2008,7 @@ rt_delete(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
}
@@ -1788,7 +2023,7 @@ rt_delete(radix_tree *tree, uint64 key)
}
/* Found the key to delete. Update the statistics */
- tree->num_keys--;
+ pg_atomic_sub_fetch_u64(&tree->ctl->num_keys, 1);
/*
* Return if the leaf node still has keys and we don't need to delete the
@@ -1822,8 +2057,8 @@ rt_delete(radix_tree *tree, uint64 key)
*/
if (level == 0)
{
- tree->root = InvalidRTPointer;
- tree->max_val = 0;
+ tree->ctl->root = InvalidRTPointer;
+ pg_atomic_write_u64(&tree->ctl->max_val, 0);
}
return true;
@@ -1838,6 +2073,8 @@ rt_begin_iterate(radix_tree *tree)
rt_iter *iter;
int top_level;
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
old_ctx = MemoryContextSwitchTo(tree->context);
iter = (rt_iter *) palloc0(sizeof(rt_iter));
@@ -1847,7 +2084,7 @@ rt_begin_iterate(radix_tree *tree)
if (!RTPointerIsValid(iter->tree))
return iter;
- root = rt_node_ptr_encoded(iter->tree->root);
+ root = rt_node_ptr_encoded(tree, iter->tree->ctl->root);
top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
iter->stack_len = top_level;
@@ -1898,6 +2135,8 @@ rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
bool
rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
{
+ Assert(!RadixTreeIsShared(iter->tree) || iter->tree->ctl->magic == RADIXTREE_MAGIC);
+
/* Empty tree */
if (!iter->tree)
return false;
@@ -2043,7 +2282,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *
if (found)
{
rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
- *child_p = rt_node_ptr_encoded(child);
+ *child_p = rt_node_ptr_encoded(iter->tree, child);
}
return found;
@@ -2146,7 +2385,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_
uint64
rt_num_entries(radix_tree *tree)
{
- return tree->num_keys;
+ return pg_atomic_read_u64(&tree->ctl->num_keys);
}
/*
@@ -2155,12 +2394,19 @@ rt_num_entries(radix_tree *tree)
uint64
rt_memory_usage(radix_tree *tree)
{
- Size total = sizeof(radix_tree);
+ Size total = sizeof(radix_tree) + sizeof(radix_tree_control);
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+ if (RadixTreeIsShared(tree))
+ total = dsa_get_total_size(tree->area);
+ else
{
- total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
- total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
}
return total;
@@ -2244,19 +2490,19 @@ rt_verify_node(rt_node_ptr node)
void
rt_stats(radix_tree *tree)
{
- rt_node_ptr root = rt_node_ptr_encoded(tree->root);
+ rt_node_ptr root = rt_node_ptr_encoded(tree, tree->ctl->root);
ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
- tree->num_keys,
+ pg_atomic_read_u64(&tree->ctl->num_keys),
NODE_SHIFT(root) / RT_NODE_SPAN,
- tree->cnt[0],
- tree->cnt[1],
- tree->cnt[2],
- tree->cnt[3])));
+ tree->ctl->cnt[0],
+ tree->ctl->cnt[1],
+ tree->ctl->cnt[2],
+ tree->ctl->cnt[3])));
}
static void
-rt_dump_node(rt_node_ptr node, int level, bool recurse)
+rt_dump_node(radix_tree *tree, rt_node_ptr node, int level, bool recurse)
{
rt_node *n = node.decoded;
char space[128] = {0};
@@ -2292,7 +2538,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
space, n4->base.chunks[i]);
if (recurse)
- rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+ rt_dump_node(tree, rt_node_ptr_encoded(tree, n4->children[i]),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2320,7 +2566,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
if (recurse)
{
- rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+ rt_dump_node(tree, rt_node_ptr_encoded(tree, n32->children[i]),
level + 1, recurse);
}
else
@@ -2373,7 +2619,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(rt_node_ptr_encoded(node_inner_128_get_child(n128, i)),
+ rt_dump_node(tree,
+ rt_node_ptr_encoded(tree,
+ node_inner_128_get_child(n128, i)),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2406,7 +2654,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+ rt_dump_node(tree,
+ rt_node_ptr_encoded(tree,
+ node_inner_256_get_child(n256, i)),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2417,6 +2667,27 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
}
}
+void
+rt_dump(radix_tree *tree)
+{
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].inner_size,
+ rt_node_kind_info[i].inner_blocksize,
+ rt_node_kind_info[i].leaf_size,
+ rt_node_kind_info[i].leaf_blocksize);
+ fprintf(stderr, "max_val = %lu\n", pg_atomic_read_u64(&tree->ctl->max_val));
+
+ if (!tree->ctl->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ rt_dump_node(tree, rt_node_ptr_encoded(tree, tree->ctl->root), 0, true);
+}
+
void
rt_dump_search(radix_tree *tree, uint64 key)
{
@@ -2425,28 +2696,30 @@ rt_dump_search(radix_tree *tree, uint64 key)
int level = 0;
elog(NOTICE, "-----------------------------------------------------------");
- elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+ elog(NOTICE, "max_val = %lu (0x%lX)",
+ pg_atomic_read_u64(&tree->ctl->max_val),
+ pg_atomic_read_u64(&tree->ctl->max_val));
- if (!RTPointerIsValid(tree->root))
+ if (!RTPointerIsValid(tree->ctl->root))
{
elog(NOTICE, "tree is empty");
return;
}
- if (key > tree->max_val)
+ if (key > pg_atomic_read_u64(&tree->ctl->max_val))
{
elog(NOTICE, "key %lu (0x%lX) is larger than max val",
key, key);
return;
}
- node = rt_node_ptr_encoded(tree->root);
+ node = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
while (shift >= 0)
{
rt_pointer child;
- rt_dump_node(node, level, false);
+ rt_dump_node(tree, node, level, false);
if (NODE_IS_LEAF(node))
{
@@ -2461,33 +2734,9 @@ rt_dump_search(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
break;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
level++;
}
}
-
-void
-rt_dump(radix_tree *tree)
-{
- rt_node_ptr root;
-
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
- fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
- rt_node_kind_info[i].name,
- rt_node_kind_info[i].inner_size,
- rt_node_kind_info[i].inner_blocksize,
- rt_node_kind_info[i].leaf_size,
- rt_node_kind_info[i].leaf_blocksize);
- fprintf(stderr, "max_val = %lu\n", tree->max_val);
-
- if (!RTPointerIsValid(tree->root))
- {
- fprintf(stderr, "empty tree\n");
- return;
- }
-
- root = rt_node_ptr_encoded(tree->root);
- rt_dump_node(root, 0, true);
-}
#endif
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 82376fde2d..ad169882af 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d5d7668617..68a11df970 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -14,18 +14,24 @@
#define RADIXTREE_H
#include "postgres.h"
+#include "utils/dsa.h"
#define RT_DEBUG 1
typedef struct radix_tree radix_tree;
typedef struct rt_iter rt_iter;
+typedef dsa_pointer rt_handle;
-extern radix_tree *rt_create(MemoryContext ctx);
+extern radix_tree *rt_create(MemoryContext ctx, dsa_area *dsa);
extern void rt_free(radix_tree *tree);
extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
extern rt_iter *rt_begin_iterate(radix_tree *tree);
+extern rt_handle rt_get_handle(radix_tree *tree);
+extern radix_tree *rt_attach(dsa_area *dsa, dsa_pointer dp);
+extern void rt_detach(radix_tree *tree);
+
extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
extern void rt_end_iterate(rt_iter *iter);
extern bool rt_delete(radix_tree *tree, uint64 key);
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 405606fe2f..dad06adecc 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index cc6970c87c..a0ff1e1c77 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -5,21 +5,38 @@ CREATE EXTENSION test_radixtree;
--
SELECT test_radixtree();
NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
NOTICE: testing radix tree node types with shift "8"
NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "16"
NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
NOTICE: testing radix tree node types with shift "32"
NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
NOTICE: testing radix tree node types with shift "48"
NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
NOTICE: testing radix tree with pattern "all ones"
NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
NOTICE: testing radix tree with pattern "clusters of ten"
NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
NOTICE: testing radix tree with pattern "one-every-64k"
NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
NOTICE: testing radix tree with pattern "single values, distance > 2^32"
NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
test_radixtree
----------------
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index cb3596755d..a948cba4ec 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -19,6 +19,7 @@
#include "nodes/bitmapset.h"
#include "storage/block.h"
#include "storage/itemptr.h"
+#include "storage/lwlock.h"
#include "utils/memutils.h"
#include "utils/timestamp.h"
@@ -111,7 +112,7 @@ test_empty(void)
radix_tree *radixtree;
uint64 dummy;
- radixtree = rt_create(CurrentMemoryContext);
+ radixtree = rt_create(CurrentMemoryContext, NULL);
if (rt_search(radixtree, 0, &dummy))
elog(ERROR, "rt_search on empty tree returned true");
@@ -217,14 +218,10 @@ test_node_types_delete(radix_tree *radixtree, uint8 shift)
* level.
*/
static void
-test_node_types(uint8 shift)
+do_test_node_types(radix_tree *radixtree, uint8 shift)
{
- radix_tree *radixtree;
-
elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
- radixtree = rt_create(CurrentMemoryContext);
-
/*
* Insert and search entries for every node type at the 'shift' level,
* then delete all entries to make it empty, and insert and search entries
@@ -233,19 +230,39 @@ test_node_types(uint8 shift)
test_node_types_insert(radixtree, shift);
test_node_types_delete(radixtree, shift);
test_node_types_insert(radixtree, shift);
+}
- rt_free(radixtree);
+static void
+test_node_types(void)
+{
+ int tranche_id = LWLockNewTrancheId();
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ {
+ radix_tree *tree;
+ dsa_area *dsa;
+
+ /* Test the local radix tree */
+ tree = rt_create(CurrentMemoryContext, NULL);
+ do_test_node_types(tree, shift);
+ rt_free(tree);
+
+ /* Test the shared radix tree */
+ dsa = dsa_create(tranche_id);
+ tree = rt_create(CurrentMemoryContext, dsa);
+ do_test_node_types(tree, shift);
+ rt_free(tree);
+ dsa_detach(dsa);
+ }
}
/*
* Test with a repeating pattern, defined by the 'spec'.
*/
static void
-test_pattern(const test_spec * spec)
+do_test_pattern(radix_tree *radixtree, const test_spec * spec)
{
- radix_tree *radixtree;
rt_iter *iter;
- MemoryContext radixtree_ctx;
TimestampTz starttime;
TimestampTz endtime;
uint64 n;
@@ -271,18 +288,6 @@ test_pattern(const test_spec * spec)
pattern_values[pattern_num_values++] = i;
}
- /*
- * Allocate the radix tree.
- *
- * Allocate it in a separate memory context, so that we can print its
- * memory usage easily.
- */
- radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
- "radixtree test",
- ALLOCSET_SMALL_SIZES);
- MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
- radixtree = rt_create(radixtree_ctx);
-
/*
* Add values to the set.
*/
@@ -336,8 +341,6 @@ test_pattern(const test_spec * spec)
mem_usage = rt_memory_usage(radixtree);
fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
mem_usage, (double) mem_usage / spec->num_values);
-
- MemoryContextStats(radixtree_ctx);
}
/* Check that rt_num_entries works */
@@ -484,21 +487,54 @@ test_pattern(const test_spec * spec)
if ((nbefore - ndeleted) != nafter)
elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
nafter, (nbefore - ndeleted), ndeleted);
+}
+
+static void
+test_patterns(void)
+{
+ int tranche_id = LWLockNewTrancheId();
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ {
+ radix_tree *tree;
+ MemoryContext radixtree_ctx;
+ dsa_area *dsa;
+ const test_spec *spec = &test_specs[i];
- MemoryContextDelete(radixtree_ctx);
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+ /* Test the local radix tree */
+ tree = rt_create(radixtree_ctx, NULL);
+ do_test_pattern(tree, spec);
+ rt_free(tree);
+ MemoryContextReset(radixtree_ctx);
+
+ /* Test the shared radix tree */
+ dsa = dsa_create(tranche_id);
+ tree = rt_create(radixtree_ctx, dsa);
+ do_test_pattern(tree, spec);
+ rt_free(tree);
+ dsa_detach(dsa);
+ MemoryContextDelete(radixtree_ctx);
+ }
}
Datum
test_radixtree(PG_FUNCTION_ARGS)
{
test_empty();
-
- for (int shift = 0; shift <= (64 - 8); shift += 8)
- test_node_types(shift);
-
- /* Test different test patterns, with lots of entries */
- for (int i = 0; i < lengthof(test_specs); i++)
- test_pattern(&test_specs[i]);
+ test_node_types();
+ test_patterns();
PG_RETURN_VOID();
}
--
2.31.1
v10-0002-Add-radix-implementation.patchapplication/octet-stream; name=v10-0002-Add-radix-implementation.patchDownload
From f6cd9570460e9ae2a53e670c94bdee0c69b883b2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v10 2/7] Add radix implementation.
---
src/backend/lib/Makefile | 1 +
src/backend/lib/meson.build | 1 +
src/backend/lib/radixtree.c | 2404 +++++++++++++++++
src/include/lib/radixtree.h | 42 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 28 +
src/test/modules/test_radixtree/meson.build | 34 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 504 ++++
.../test_radixtree/test_radixtree.control | 4 +
15 files changed, 3069 insertions(+)
create mode 100644 src/backend/lib/radixtree.c
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
integerset.o \
knapsack.o \
pairingheap.o \
+ radixtree.o \
rbtree.o \
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 48da1bddce..4303d306cd 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -9,4 +9,5 @@ backend_sources += files(
'knapsack.c',
'pairingheap.c',
'rbtree.c',
+ 'radixtree.c',
)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..6159b73b75
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2404 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves". We
+ * choose it to avoid an additional pointer traversal. It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create - Create a new, empty radix tree
+ * rt_free - Free the radix tree
+ * rt_search - Search a key-value pair
+ * rt_set - Set a key-value pair
+ * rt_delete - Delete a key-value pair
+ * rt_begin_iterate - Begin iterating through all key-value pairs
+ * rt_iterate_next - Return next key-value pair, if any
+ * rt_end_iter - End iteration
+ * rt_memory_usage - Get the memory usage
+ * rt_num_entries - Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-128 */
+#define RT_NODE_128_INVALID_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+ RT_ACTION_FIND = 0, /* find the key-value */
+ RT_ACTION_DELETE, /* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree node kinds.
+ *
+ * XXX: These are currently not well chosen. To reduce memory fragmentation
+ * smaller class should optimally fit neatly into the next larger class
+ * (except perhaps at the lowest end). Right now its
+ * 40/40 -> 296/286 -> 1288/1304 -> 2056/2088 bytes for inner nodes and
+ * leaf nodes, respectively, leading to large amount of allocator padding
+ * with aset.c. Hence the use of slab.
+ *
+ * XXX: need to have node-1 until there is no path compression optimization?
+ *
+ * XXX: need to explain why we choose these node types based on benchmark
+ * results etc.
+ */
+#define RT_NODE_KIND_4 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_128 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+ uint8 chunk;
+
+ /* Size kind of the node */
+ uint8 kind;
+} rt_node;
+#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
+#define NODE_HAS_FREE_SLOT(n) \
+ (((rt_node *) (n))->count < rt_node_kind_info[((rt_node *) (n))->kind].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+typedef struct rt_node_base_4
+{
+ rt_node n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+ rt_node n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-128 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 128 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base128
+{
+ rt_node n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+} rt_node_base_128;
+
+typedef struct rt_node_base256
+{
+ rt_node n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * There are separate from inner node size classes for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ * width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+ rt_node_base_4 base;
+
+ /* 4 children, for key chunks */
+ rt_node *children[4];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+ rt_node_base_4 base;
+
+ /* 4 values, for key chunks */
+ uint64 values[4];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+ rt_node_base_32 base;
+
+ /* 32 children, for key chunks */
+ rt_node *children[32];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+ rt_node_base_32 base;
+
+ /* 32 values, for key chunks */
+ uint64 values[32];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_128
+{
+ rt_node_base_128 base;
+
+ /* Slots for 128 children */
+ rt_node *children[128];
+} rt_node_inner_128;
+
+typedef struct rt_node_leaf_128
+{
+ rt_node_base_128 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(128)];
+
+ /* Slots for 128 values */
+ uint64 values[128];
+} rt_node_leaf_128;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+ rt_node_base_256 base;
+
+ /* Slots for 256 children */
+ rt_node *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+ rt_node_base_256 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ uint64 values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information of each size kinds */
+typedef struct rt_node_kind_info_elem
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+
+ /* slab block size */
+ Size inner_blocksize;
+ Size leaf_blocksize;
+} rt_node_kind_info_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * size, (size) * 32)
+static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
+
+ [RT_NODE_KIND_4] = {
+ .name = "radix tree node 4",
+ .fanout = 4,
+ .inner_size = sizeof(rt_node_inner_4),
+ .leaf_size = sizeof(rt_node_leaf_4),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4)),
+ },
+ [RT_NODE_KIND_32] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(rt_node_inner_32),
+ .leaf_size = sizeof(rt_node_leaf_32),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32)),
+ },
+ [RT_NODE_KIND_128] = {
+ .name = "radix tree node 128",
+ .fanout = 128,
+ .inner_size = sizeof(rt_node_inner_128),
+ .leaf_size = sizeof(rt_node_leaf_128),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128)),
+ },
+ [RT_NODE_KIND_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(rt_node_inner_256),
+ .leaf_size = sizeof(rt_node_leaf_256),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+ },
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+ rt_node *node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+ radix_tree *tree;
+
+ /* Track the iteration on nodes of each level */
+ rt_node_iter stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ rt_node *root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+ MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_NODE_KIND_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
+ bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+ rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+ uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+ uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ /* For better code generation */
+ if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ pg_unreachable();
+
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+ uint8 *dst_chunks, uint64 *dst_values, int count)
+{
+ /* For better code generation */
+ if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ pg_unreachable();
+
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_values, src_values, sizeof(uint64) * count);
+}
+
+/* Functions to manipulate inner and leaf node-128 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_128_is_chunk_used(rt_node_base_128 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return (node->children[slot] != NULL);
+}
+
+static inline bool
+node_leaf_128_is_slot_used(rt_node_leaf_128 *node, uint8 slot)
+{
+ Assert(NODE_IS_LEAF(node));
+ return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+static inline rt_node *
+node_inner_128_get_child(rt_node_inner_128 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_128_get_value(rt_node_leaf_128 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(((rt_node_base_128 *) node)->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+static void
+node_inner_128_delete(rt_node_inner_128 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+static void
+node_leaf_128_delete(rt_node_leaf_128 *node, uint8 chunk)
+{
+ int slotpos = node->base.slot_idxs[chunk];
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+ node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+/* Return an unused slot in node-128 */
+static int
+node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
+{
+ int slotpos = 0;
+
+ Assert(!NODE_IS_LEAF(node));
+ while (node_inner_128_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+static int
+node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ /* We iterate over the isset bitmap per byte then check each bit */
+ for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+ {
+ if (node->isset[slotpos] < 0xFF)
+ break;
+ }
+ Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+ slotpos *= BITS_PER_BYTE;
+ while (node_leaf_128_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+static inline void
+node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+{
+ int slotpos;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ /* find unused slot */
+ slotpos = node_inner_128_find_unused_slot(node, chunk);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ /* find unused slot */
+ slotpos = node_leaf_128_find_unused_slot(node, chunk);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+ node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+node_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+static inline void
+node_leaf_128_update(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(node_inner_256_is_chunk_used(node, chunk));
+ return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(node_leaf_256_is_chunk_used(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+ int shift = key_get_shift(key);
+ rt_node *node;
+
+ node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0,
+ shift > 0);
+ tree->max_val = shift_get_max_val(shift);
+ tree->root = node;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
+{
+ rt_node *newnode;
+
+ if (inner)
+ newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_kind_info[kind].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_kind_info[kind].leaf_size);
+
+ newnode->kind = kind;
+ newnode->shift = shift;
+ newnode->chunk = chunk;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_128)
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) newnode;
+
+ memset(n128->slot_idxs, RT_NODE_128_INVALID_IDX, sizeof(n128->slot_idxs));
+ }
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[kind]++;
+#endif
+
+ return newnode;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node *
+rt_copy_node(radix_tree *tree, rt_node *node, int new_kind)
+{
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, new_kind, node->shift, node->chunk,
+ node->shift > 0);
+ newnode->count = node->count;
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->root == node)
+ tree->root = NULL;
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[node->kind]--;
+ Assert(tree->cnt[node->kind] >= 0);
+#endif
+
+ pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+ rt_node *new_child, uint64 key)
+{
+ Assert(old_child->chunk == new_child->chunk);
+ Assert(old_child->shift == new_child->shift);
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new large node */
+ tree->root = new_child;
+ }
+ else
+ {
+ bool replaced PG_USED_FOR_ASSERTS_ONLY;
+
+ replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+ Assert(replaced);
+ }
+
+ rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+ int target_shift;
+ int shift = tree->root->shift + RT_NODE_SPAN;
+
+ target_shift = key_get_shift(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ rt_node_inner_4 *node;
+
+ node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4,
+ shift, 0, true);
+ node->base.n.count = 1;
+ node->base.chunks[0] = 0;
+ node->children[0] = tree->root;
+
+ tree->root->chunk = 0;
+ tree->root = (rt_node *) node;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+ rt_node *node)
+{
+ int shift = node->shift;
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ rt_node *newchild;
+ int newshift = shift - RT_NODE_SPAN;
+
+ newchild = rt_alloc_node(tree, RT_NODE_KIND_4, newshift,
+ RT_GET_KEY_CHUNK(key, node->shift),
+ newshift > 0);
+ rt_node_insert_inner(tree, parent, node, key, newchild);
+
+ parent = node;
+ node = newchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ rt_node_insert_leaf(tree, parent, node, key, value);
+ tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ rt_node *child = NULL;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = n4->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n4->base.chunks, n4->children,
+ n4->base.n.count, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = n32->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = node_inner_128_get_child(n128, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_128_delete(n128, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = node_inner_256_get_child(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ if (found && child_p)
+ *child_p = child;
+
+ return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ uint64 value = 0;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = n4->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+ n4->base.n.count, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = n32->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+ n32->base.n.count, idx);
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_128_get_value(n128, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_128_delete(n128, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_256_get_value(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ if (found && value_p)
+ *value_p = value;
+
+ return found;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+ rt_node *child)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->children[idx] = child;
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+ {
+ rt_node_inner_32 *new32;
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_children_array_copy(n4->base.chunks, n4->children,
+ new32->base.chunks, new32->children,
+ n4->base.n.count);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+ key);
+ node = (rt_node *) new32;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ uint16 count = n4->base.n.count;
+
+ /* shift chunks and children */
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n4->base.chunks, n4->children,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->children[insertpos] = child;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->children[idx] = child;
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+ {
+ rt_node_inner_128 *new128;
+
+ /* grow node from 32 to 128 */
+ new128 = (rt_node_inner_128 *) rt_copy_node(tree, (rt_node *) n32,
+ RT_NODE_KIND_128);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+ key);
+ node = (rt_node *) new128;
+ }
+ else
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int16 count = n32->base.n.count;
+
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n32->base.chunks, n32->children,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->children[insertpos] = child;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_128:
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+ int cnt = 0;
+
+ if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ node_inner_128_update(n128, chunk, child);
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+ {
+ rt_node_inner_256 *new256;
+
+ /* grow node from 128 to 256 */
+ new256 = (rt_node_inner_256 *) rt_copy_node(tree, (rt_node *) n128,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+ {
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ continue;
+
+ node_inner_256_set(new256, i, node_inner_128_get_child(n128, i));
+ cnt++;
+ }
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ else
+ {
+ node_inner_128_insert(n128, chunk, child);
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+
+ node_inner_256_set(n256, chunk, child);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(NODE_IS_LEAF(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->values[idx] = value;
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+ {
+ rt_node_leaf_32 *new32;
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_leaf_32 *) rt_copy_node(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_values_array_copy(n4->base.chunks, n4->values,
+ new32->base.chunks, new32->values,
+ n4->base.n.count);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+ key);
+ node = (rt_node *) new32;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ int count = n4->base.n.count;
+
+ /* shift chunks and values */
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n4->base.chunks, n4->values,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->values[insertpos] = value;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->values[idx] = value;
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+ {
+ rt_node_leaf_128 *new128;
+
+ /* grow node from 32 to 128 */
+ new128 = (rt_node_leaf_128 *) rt_copy_node(tree, (rt_node *) n32,
+ RT_NODE_KIND_128);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+ key);
+ node = (rt_node *) new128;
+ }
+ else
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int count = n32->base.n.count;
+
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n32->base.chunks, n32->values,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->values[insertpos] = value;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_128:
+ {
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+ int cnt = 0;
+
+ if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ node_leaf_128_update(n128, chunk, value);
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+ {
+ rt_node_leaf_256 *new256;
+
+ /* grow node from 128 to 256 */
+ new256 = (rt_node_leaf_256 *) rt_copy_node(tree, (rt_node *) n128,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+ {
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ continue;
+
+ node_leaf_256_set(new256, i, node_leaf_128_get_value(n128, i));
+ cnt++;
+ }
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ else
+ {
+ node_leaf_128_insert(n128, chunk, value);
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+
+ node_leaf_256_set(n256, chunk, value);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+ radix_tree *tree;
+ MemoryContext old_ctx;
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = palloc(sizeof(radix_tree));
+ tree->context = ctx;
+ tree->root = NULL;
+ tree->max_val = 0;
+ tree->num_keys = 0;
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].inner_blocksize,
+ rt_node_kind_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].leaf_blocksize,
+ rt_node_kind_info[i].leaf_size);
+#ifdef RT_DEBUG
+ tree->cnt[i] = 0;
+#endif
+ }
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+ int shift;
+ bool updated;
+ rt_node *node;
+ rt_node *parent;
+
+ /* Empty tree, create the root */
+ if (!tree->root)
+ rt_new_root(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->max_val)
+ rt_extend(tree, key);
+
+ Assert(tree->root);
+
+ shift = tree->root->shift;
+ node = parent = tree->root;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ {
+ rt_set_extend(tree, key, value, parent, node);
+ return false;
+ }
+
+ parent = node;
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->num_keys++;
+
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+ rt_node *node;
+ int shift;
+
+ Assert(value_p != NULL);
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ node = tree->root;
+ shift = tree->root->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ rt_node *stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ node = tree->root;
+ shift = tree->root->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ rt_node *child;
+
+ /* Push the current node to the stack */
+ stack[++level] = node;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ Assert(NODE_IS_LEAF(node));
+ deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (!NODE_IS_EMPTY(node))
+ return true;
+
+ /* Free the empty leaf node */
+ rt_free_node(tree, node);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ node = stack[level--];
+
+ deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!NODE_IS_EMPTY(node))
+ break;
+
+ /* The node became empty */
+ rt_free_node(tree, node);
+ }
+
+ /*
+ * If we eventually deleted the root node while recursively deleting empty
+ * nodes, we make the tree empty.
+ */
+ if (level == 0)
+ {
+ tree->root = NULL;
+ tree->max_val = 0;
+ }
+
+ return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+ MemoryContext old_ctx;
+ rt_iter *iter;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (rt_iter *) palloc0(sizeof(rt_iter));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree)
+ return iter;
+
+ top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+ int level = from;
+ rt_node *node = from_node;
+
+ for (;;)
+ {
+ rt_node_iter *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = rt_node_inner_iterate_next(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree)
+ return false;
+
+ for (;;)
+ {
+ rt_node *child = NULL;
+ uint64 value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ rt_update_iter_stack(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+ pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+ rt_node *child = NULL;
+ bool found = false;
+ uint8 key_chunk;
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+
+ child = n4->children[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+ child = n32->children[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_128_get_child(n128, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_inner_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_256_get_child(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+ return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p)
+{
+ rt_node *node = node_iter->node;
+ bool found = false;
+ uint64 value;
+ uint8 key_chunk;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+
+ value = n4->values[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+ value = n32->values[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_128_get_value(n128, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_leaf_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_256_get_value(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ *value_p = value;
+ }
+
+ return found;
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+ return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+ Size total = sizeof(radix_tree);
+
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+ for (int i = 1; i < n4->n.count; i++)
+ Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_128_is_chunk_used(n128, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ if (NODE_IS_LEAF(node))
+ Assert(node_leaf_128_is_slot_used((rt_node_leaf_128 *) node,
+ n128->slot_idxs[i]));
+ else
+ Assert(node_inner_128_is_slot_used((rt_node_inner_128 *) node,
+ n128->slot_idxs[i]));
+
+ cnt++;
+ }
+
+ Assert(n128->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+ cnt += pg_popcount32(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+ ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
+ tree->num_keys,
+ tree->root->shift / RT_NODE_SPAN,
+ tree->cnt[0],
+ tree->cnt[1],
+ tree->cnt[2],
+ tree->cnt[3])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+ char space[128] = {0};
+
+ fprintf(stderr, "[%s] kind %d, count %u, shift %u, chunk 0x%X:\n",
+ NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_128) ? 128 : 256,
+ node->count, node->shift, node->chunk);
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+ space, n4->base.chunks[i], n4->values[i]);
+ }
+ else
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n4->base.chunks[i]);
+
+ if (recurse)
+ rt_dump_node(n4->children[i], level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+ space, n32->base.chunks[i], n32->values[i]);
+ }
+ else
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ rt_dump_node(n32->children[i], level + 1, recurse);
+ }
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *b128 = (rt_node_base_128 *) node;
+
+ fprintf(stderr, "slot_idxs ");
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_128_is_chunk_used(b128, i))
+ continue;
+
+ fprintf(stderr, " [%d]=%d, ", i, b128->slot_idxs[i]);
+ }
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_128 *n = (rt_node_leaf_128 *) node;
+
+ fprintf(stderr, ", isset-bitmap:");
+ for (int i = 0; i < 16; i++)
+ {
+ fprintf(stderr, "%X ", (uint8) n->isset[i]);
+ }
+ fprintf(stderr, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_128_is_chunk_used(b128, i))
+ continue;
+
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) b128;
+
+ fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+ space, i, node_leaf_128_get_value(n128, i));
+ }
+ else
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) b128;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_128_get_child(n128, i),
+ level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+ space, i, node_leaf_256_get_value(n256, i));
+ }
+ else
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+ recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+ if (!tree->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->max_val)
+ {
+ elog(NOTICE, "key %lu (0x%lX) is larger than max val",
+ key, key);
+ return;
+ }
+
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ rt_dump_node(node, level, false);
+
+ if (NODE_IS_LEAF(node))
+ {
+ uint64 dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+ break;
+ }
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].inner_size,
+ rt_node_kind_info[i].inner_blocksize,
+ rt_node_kind_info[i].leaf_size,
+ rt_node_kind_info[i].leaf_blocksize);
+ fprintf(stderr, "max_val = %lu\n", tree->max_val);
+
+ if (!tree->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif /* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 96addded81..11d0ec5b07 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -27,6 +27,7 @@ SUBDIRS = \
test_parser \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1d26544854..568823b221 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -21,6 +21,7 @@ subdir('test_oat_hooks')
subdir('test_parser')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..cc6970c87c
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,28 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..cb3596755d
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,504 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+/* The maximum number of entries each node type can have */
+static int rt_node_max_entries[] = {
+ 4, /* RT_NODE_KIND_4 */
+ 16, /* RT_NODE_KIND_16 */
+ 32, /* RT_NODE_KIND_32 */
+ 128, /* RT_NODE_KIND_128 */
+ 256 /* RT_NODE_KIND_256 */
+};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 10000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ radix_tree *radixtree;
+ uint64 dummy;
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ uint64 val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, key);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+ for (int j = 0; j < lengthof(rt_node_max_entries); j++)
+ {
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (i == (rt_node_max_entries[j] - 1))
+ {
+ check_search_on_node(radixtree, shift,
+ (j == 0) ? 0 : rt_node_max_entries[j - 1],
+ rt_node_max_entries[j]);
+ break;
+ }
+ }
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "inserted key 0x" UINT64_HEX_FORMAT " is not found", key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ radix_tree *radixtree;
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift);
+
+ rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+ radixtree = rt_create(radixtree_ctx);
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
--
2.31.1
v10-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/octet-stream; name=v10-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload
From 9fd128f027302de19075942180b749ebd184007b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v10 1/7] introduce vector8_min and vector8_highbit_mask
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..0b288c422a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
static inline bool vector8_has_zero(const Vector8 v);
static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
#endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
#endif
}
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+ uint32 mask = 0;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+ return mask;
+#endif
+}
+
/*
* Exactly like vector8_is_highbit_set except for the input type, so it
* looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.31.1
On Mon, Nov 21, 2022 at 4:20 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Fri, Nov 18, 2022 at 2:48 PM I wrote:
One issue with this patch: The "fanout" member is a uint8, so it can't hold 256 for the largest node kind. That's not an issue in practice, since we never need to grow it, and we only compare that value with the count in an Assert(), so I just set it to zero. That does break an invariant, so it's not great. We could use 2 bytes to be strictly correct in all cases, but that limits what we can do with the smallest node kind.
Thinking about this part, there's an easy resolution -- use a different macro for fixed- and variable-sized node kinds to determine if there is a free slot.
Also, I wanted to share some results of adjusting the boundary between the two smallest node kinds. In the hackish attached patch, I modified the fixed height search benchmark to search a small (within L1 cache) tree thousands of times. For the first set I modified node4's maximum fanout and filled it up. For the second, I set node4's fanout to 1, which causes 2+ to spill to node32 (actually the partially-filled node15 size class as demoed earlier).
node4:
NOTICE: num_keys = 16, height = 3, n4 = 15, n15 = 0, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
2 | 16 | 16520 | 0 | 3NOTICE: num_keys = 81, height = 3, n4 = 40, n15 = 0, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
3 | 81 | 16456 | 0 | 17NOTICE: num_keys = 256, height = 3, n4 = 85, n15 = 0, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
4 | 256 | 16456 | 0 | 89NOTICE: num_keys = 625, height = 3, n4 = 156, n15 = 0, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
5 | 625 | 16488 | 0 | 327node32:
NOTICE: num_keys = 16, height = 3, n4 = 0, n15 = 15, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
2 | 16 | 16488 | 0 | 5
(1 row)NOTICE: num_keys = 81, height = 3, n4 = 0, n15 = 40, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
3 | 81 | 16520 | 0 | 28NOTICE: num_keys = 256, height = 3, n4 = 0, n15 = 85, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
4 | 256 | 16408 | 0 | 79NOTICE: num_keys = 625, height = 3, n4 = 0, n15 = 156, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
5 | 625 | 24616 | 0 | 199In this test, node32 seems slightly faster than node4 with 4 elements, at the cost of more memory.
Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of variable-sized nodes), 3 has a nice property: no wasted padding space:
node4: 5 + 4+(7) + 4*8 = 48 bytes
node3: 5 + 3 + 3*8 = 32
IIUC if we store the fanout member only in variable-sized nodes,
rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
size doesn't change but there is 1 byte padding space.
Also, even if we have the node3 a variable-sized node, size class 1
for node3 could be a good choice since it also doesn't need padding
space and could be a good alternative to path compression.
node3 : 5 + 3 + 3*8 = 32 bytes
size class 1 : 5 + 3 + 1*8 = 16 bytes
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Mon, Nov 21, 2022 at 3:43 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Mon, Nov 21, 2022 at 4:20 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
Assuming the smallest node is fixed size (i.e. fanout/capacity member
not part of the common set, so only part of variable-sized nodes), 3 has a
nice property: no wasted padding space:
node4: 5 + 4+(7) + 4*8 = 48 bytes
node3: 5 + 3 + 3*8 = 32IIUC if we store the fanout member only in variable-sized nodes,
rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
size doesn't change but there is 1 byte padding space.
I forgot to mention I'm assuming no pointer-tagging for this exercise.
You've demonstrated it can be done in a small amount of code, and I hope we
can demonstrate a speedup in search. Just in case there is some issue with
portability, valgrind, or some other obstacle, I'm being pessimistic in my
calculations.
Also, even if we have the node3 a variable-sized node, size class 1
for node3 could be a good choice since it also doesn't need padding
space and could be a good alternative to path compression.node3 : 5 + 3 + 3*8 = 32 bytes
size class 1 : 5 + 3 + 1*8 = 16 bytes
Precisely! I have that scenario in my notes as well -- it's quite
compelling.
--
John Naylor
EDB: http://www.enterprisedb.com
On 2022-11-21 17:06:56 +0900, Masahiko Sawada wrote:
Sure. I've attached the v10 patches. 0004 is the pure refactoring
patch and 0005 patch introduces the pointer tagging.
This failed on cfbot, with som many crashes that the VM ran out of disk for
core dumps. During testing with 32bit, so there's probably something broken
around that.
https://cirrus-ci.com/task/4635135954386944
A failure is e.g. at: https://api.cirrus-ci.com/v1/artifact/task/4635135954386944/testrun/build-32/testrun/adminpack/regress/log/initdb.log
performing post-bootstrap initialization ... ../src/backend/lib/radixtree.c:1696:21: runtime error: member access within misaligned address 0x590faf74 for type 'struct radix_tree_control', which requires 8 byte alignment
0x590faf74: note: pointer points here
90 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
==55813==Using libbacktrace symbolizer.
#0 0x56dcc274 in rt_create ../src/backend/lib/radixtree.c:1696
#1 0x56953d1b in tidstore_create ../src/backend/access/common/tidstore.c:57
#2 0x56a1ca4f in dead_items_alloc ../src/backend/access/heap/vacuumlazy.c:3109
#3 0x56a2219f in heap_vacuum_rel ../src/backend/access/heap/vacuumlazy.c:539
#4 0x56cb77ed in table_relation_vacuum ../src/include/access/tableam.h:1681
#5 0x56cb77ed in vacuum_rel ../src/backend/commands/vacuum.c:2062
#6 0x56cb9a16 in vacuum ../src/backend/commands/vacuum.c:472
#7 0x56cba904 in ExecVacuum ../src/backend/commands/vacuum.c:272
#8 0x5711b6d0 in standard_ProcessUtility ../src/backend/tcop/utility.c:866
#9 0x5711bdeb in ProcessUtility ../src/backend/tcop/utility.c:530
#10 0x5711759f in PortalRunUtility ../src/backend/tcop/pquery.c:1158
#11 0x57117cb8 in PortalRunMulti ../src/backend/tcop/pquery.c:1315
#12 0x571183d2 in PortalRun ../src/backend/tcop/pquery.c:791
#13 0x57111049 in exec_simple_query ../src/backend/tcop/postgres.c:1238
#14 0x57113f9c in PostgresMain ../src/backend/tcop/postgres.c:4551
#15 0x5711463d in PostgresSingleUserMain ../src/backend/tcop/postgres.c:4028
#16 0x56df4672 in main ../src/backend/main/main.c:197
#17 0xf6ad8e45 in __libc_start_main (/lib/i386-linux-gnu/libc.so.6+0x1ae45)
#18 0x5691d0f0 in _start (/tmp/cirrus-ci-build/build-32/tmp_install/usr/local/pgsql/bin/postgres+0x3040f0)
Aborted (core dumped)
child process exited with exit code 134
initdb: data directory "/tmp/cirrus-ci-build/build-32/testrun/adminpack/regress/tmp_check/data" not removed at user's request
On Mon, Nov 21, 2022 at 6:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Mon, Nov 21, 2022 at 3:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Nov 21, 2022 at 4:20 PM John Naylor
<john.naylor@enterprisedb.com> wrote:Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of variable-sized nodes), 3 has a nice property: no wasted padding space:
node4: 5 + 4+(7) + 4*8 = 48 bytes
node3: 5 + 3 + 3*8 = 32IIUC if we store the fanout member only in variable-sized nodes,
rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
size doesn't change but there is 1 byte padding space.I forgot to mention I'm assuming no pointer-tagging for this exercise. You've demonstrated it can be done in a small amount of code, and I hope we can demonstrate a speedup in search. Just in case there is some issue with portability, valgrind, or some other obstacle, I'm being pessimistic in my calculations.
Also, even if we have the node3 a variable-sized node, size class 1
for node3 could be a good choice since it also doesn't need padding
space and could be a good alternative to path compression.node3 : 5 + 3 + 3*8 = 32 bytes
size class 1 : 5 + 3 + 1*8 = 16 bytesPrecisely! I have that scenario in my notes as well -- it's quite compelling.
So it seems that there are two candidates of rt_node structure: (1)
all nodes except for node256 are variable-size nodes and use pointer
tagging, and (2) node32 and node128 are variable-sized nodes and do
not use pointer tagging (fanout is in part of only these two nodes).
rt_node can be 5 bytes in both cases. But before going to this step, I
started to verify the idea of variable-size nodes by using 6-bytes
rt_node. We can adjust the node kinds and node classes later.
In this verification, I have all nodes except for node256
variable-sized nodes, and the sizes are:
radix tree node 1 : 6 + 4 + (6) + 1*8 = 24 bytes
radix tree node 4 : 6 + 4 + (6) + 4*8 = 48
radix tree node 15 : 6 + 32 + (2) + 15*8 = 160
radix tree node 32 : 6 + 32 + (2) + 32*8 = 296
radix tree node 61 : inner 6 + 256 + (2) + 61*8 = 752, leaf 6 +
256 + (2) + 16 + 61*8 = 768
radix tree node 128 : inner 6 + 256 + (2) + 128*8 = 1288, leaf 6 +
256 + (2) + 16 + 128*8 = 1304
radix tree node 256 : inner 6 + (2) + 256*8 = 2056, leaf 6 + (2) + 32
+ 256*8 = 2088
I did some performance tests against two radix trees: a radix tree
supporting only fixed-size nodes (i.e. applying up to 0003 patch), and
a radix tree supporting variable-size nodes (i.e. applying all
attached patches). Also, I changed bench_search_random_nodes()
function so that we can specify the filter via a function argument.
Here are results:
Here are results:
* Query
select * from bench_seq_search(0, 1*1000*1000, false)
* Fixed-size
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n32 = 31251, n128 =
1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 9871216 | | 67 |
| 212 |
(1 row)
* Variable-size
NOTICE: num_keys = 1000000, height = 2, n1 = 0, n4 = 0, n15 = 0, n32
= 31251, n61 = 0, n128 = 1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 9871280 | | 74 |
| 212 |
(1 row)
---
* Query
select * from bench_seq_search(0, 2*1000*1000, true)
NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1,
n256 = 245
* Fixed-size
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 19680848 | | 74 |
| 201 |
(1 row)
* Variable-size
NOTICE: num_keys = 999654, height = 2, n1 = 0, n4 = 1, n15 = 26951,
n32 = 35548, n61 = 1, n128 = 0, n256 = 245
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 16009040 | | 85 |
| 201 |
(1 row)
---
* Query
select * from bench_search_random_nodes(10 * 1000 * 1000, '0x7F07FF00FF')
* Fixed-size
NOTICE: num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603,
n128 = 182670, n256 = 1024
mem_allocated | search_ms
---------------+-----------
343001456 | 1151
(1 row)
* Variable-size
NOTICE: num_keys = 9291812, height = 4, n1 = 262144, n4 = 0, n15 =
138, n32 = 79465, n61 = 182665, n128 = 5, n256 = 1024
mem_allocated | search_ms
---------------+-----------
230504328 | 1077
(1 row)
---
* Query
select * from bench_search_random_nodes(10 * 1000 * 1000, '0xFFFF0000003F')
* Fixed-size
NOTICE: num_keys = 3807650, height = 5, n4 = 196608, n32 = 0, n128 =
65536, n256 = 257
mem_allocated | search_ms
---------------+-----------
99911920 | 632
(1 row)
* Variable-size
NOTICE: num_keys = 3807650, height = 5, n1 = 196608, n4 = 0, n15 = 0,
n32 = 0, n61 = 61747, n128 = 3789, n256 = 257
mem_allocated | search_ms
---------------+-----------
64045688 | 554
(1 row)
Overall, the idea of variable-sized nodes is good, smaller size
without losing search performance. I'm going to check the load
performance as well.
I've attached the patches I used for the verification. I don't include
patches for pointer tagging, DSA support, and vacuum integration since
I'm investigating the issue on cfbot that Andres reported. Also, I've
modified tests to improve the test coverage.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v11-0004-Preparatory-refactoring-for-decoupling-kind-from.patchapplication/x-patch; name=v11-0004-Preparatory-refactoring-for-decoupling-kind-from.patchDownload
From 9b8d423d8a1969b698dcd07bbfd1e309e86bddd2 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Thu, 17 Nov 2022 12:10:31 +0700
Subject: [PATCH v11 4/6] Preparatory refactoring for decoupling kind from size
class
Rename the current kind info array to refer to size classes, but
keep all the contents the same.
Add a fanout member to all nodes which stores the max capacity of
the node. This is currently set with the same hardcoded value as
in the kind info array.
In passing, remove outdated reference to node16 in the regression
test.
---
src/backend/lib/radixtree.c | 196 +++++++++++++++++++++---------------
1 file changed, 117 insertions(+), 79 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index cc1a629fed..b71545e031 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -131,6 +131,16 @@ typedef enum
#define RT_NODE_KIND_256 0x03
#define RT_NODE_KIND_COUNT 4
+typedef enum rt_size_class
+{
+ RT_CLASS_4_FULL = 0,
+ RT_CLASS_32_FULL,
+ RT_CLASS_128_FULL,
+ RT_CLASS_256
+
+#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
+} rt_size_class;
+
/* Common type for all nodes types */
typedef struct rt_node
{
@@ -140,6 +150,9 @@ typedef struct rt_node
*/
uint16 count;
+ /* Max number of children. We can use uint8 because we never need to store 256 */
+ uint8 fanout;
+
/*
* Shift indicates which part of the key space is represented by this
* node. That is, the key is shifted by 'shift' and the lowest
@@ -148,13 +161,13 @@ typedef struct rt_node
uint8 shift;
uint8 chunk;
- /* Size kind of the node */
+ /* Node kind, one per search/set algorithm */
uint8 kind;
} rt_node;
#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
-#define NODE_HAS_FREE_SLOT(n) \
- (((rt_node *) (n))->count < rt_node_kind_info[((rt_node *) (n))->kind].fanout)
+#define NODE_HAS_FREE_SLOT(node) \
+ ((node)->base.n.count < (node)->base.n.fanout)
/* Base type of each node kinds for leaf and inner nodes */
typedef struct rt_node_base_4
@@ -194,7 +207,7 @@ typedef struct rt_node_base256
/*
* Inner and leaf nodes.
*
- * There are separate from inner node size classes for two main reasons:
+ * Theres are separate for two main reasons:
*
* 1) the value type might be different than something fitting into a pointer
* width type
@@ -278,8 +291,8 @@ typedef struct rt_node_leaf_256
uint64 values[RT_NODE_MAX_SLOTS];
} rt_node_leaf_256;
-/* Information of each size kinds */
-typedef struct rt_node_kind_info_elem
+/* Information for each size class */
+typedef struct rt_size_class_elem
{
const char *name;
int fanout;
@@ -291,7 +304,7 @@ typedef struct rt_node_kind_info_elem
/* slab block size */
Size inner_blocksize;
Size leaf_blocksize;
-} rt_node_kind_info_elem;
+} rt_size_class_elem;
/*
* Calculate the slab blocksize so that we can allocate at least 32 chunks
@@ -299,9 +312,9 @@ typedef struct rt_node_kind_info_elem
*/
#define NODE_SLAB_BLOCK_SIZE(size) \
Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * size, (size) * 32)
-static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
+static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
- [RT_NODE_KIND_4] = {
+ [RT_CLASS_4_FULL] = {
.name = "radix tree node 4",
.fanout = 4,
.inner_size = sizeof(rt_node_inner_4),
@@ -309,7 +322,7 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4)),
.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4)),
},
- [RT_NODE_KIND_32] = {
+ [RT_CLASS_32_FULL] = {
.name = "radix tree node 32",
.fanout = 32,
.inner_size = sizeof(rt_node_inner_32),
@@ -317,7 +330,7 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32)),
.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32)),
},
- [RT_NODE_KIND_128] = {
+ [RT_CLASS_128_FULL] = {
.name = "radix tree node 128",
.fanout = 128,
.inner_size = sizeof(rt_node_inner_128),
@@ -325,9 +338,11 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128)),
.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128)),
},
- [RT_NODE_KIND_256] = {
+ [RT_CLASS_256] = {
.name = "radix tree node 256",
- .fanout = 256,
+ /* technically it's 256, but we can't store that in a uint8,
+ and this is the max size class so it will never grow */
+ .fanout = 0,
.inner_size = sizeof(rt_node_inner_256),
.leaf_size = sizeof(rt_node_leaf_256),
.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
@@ -335,6 +350,14 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
},
};
+/* Map from the node kind to its minimum size class */
+static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
+ [RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+ [RT_NODE_KIND_32] = RT_CLASS_32_FULL,
+ [RT_NODE_KIND_128] = RT_CLASS_128_FULL,
+ [RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
/*
* Iteration support.
*
@@ -376,21 +399,21 @@ struct radix_tree
uint64 max_val;
uint64 num_keys;
- MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
- MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
/* statistics */
#ifdef RT_DEBUG
- int32 cnt[RT_NODE_KIND_COUNT];
+ int32 cnt[RT_SIZE_CLASS_COUNT];
#endif
};
static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node * rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift,
- uint8 chunk, bool inner);
-static inline void rt_init_node(rt_node *node, uint8 kind, uint8 shift, uint8 chunk,
- bool inner);
-static rt_node *rt_alloc_node(radix_tree *tree, int kind, bool inner);
+static rt_node * rt_alloc_init_node(radix_tree *tree, uint8 kind, rt_size_class size_class,
+ uint8 shift, uint8 chunk, bool inner);
+static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, uint8 shift,
+ uint8 chunk, bool inner);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
static void rt_free_node(radix_tree *tree, rt_node *node);
static void rt_extend(radix_tree *tree, uint64 key);
static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
@@ -591,7 +614,7 @@ chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
uint8 *dst_chunks, rt_node **dst_children, int count)
{
/* For better code generation */
- if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ if (count > rt_size_class_info[RT_CLASS_4_FULL].fanout)
pg_unreachable();
memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
@@ -603,7 +626,7 @@ chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
uint8 *dst_chunks, uint64 *dst_values, int count)
{
/* For better code generation */
- if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ if (count > rt_size_class_info[RT_CLASS_4_FULL].fanout)
pg_unreachable();
memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
@@ -844,20 +867,21 @@ rt_new_root(radix_tree *tree, uint64 key)
int shift = key_get_shift(key);
rt_node *node;
- node = (rt_node *) rt_alloc_init_node(tree, RT_NODE_KIND_4, shift, 0,
- shift > 0);
+ node = (rt_node *) rt_alloc_init_node(tree, RT_NODE_KIND_4, RT_CLASS_4_FULL,
+ shift, 0, shift > 0);
tree->max_val = shift_get_max_val(shift);
tree->root = node;
}
/* Return a new and initialized node */
static rt_node *
-rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk, bool inner)
+rt_alloc_init_node(radix_tree *tree, uint8 kind, rt_size_class size_class, uint8 shift,
+ uint8 chunk, bool inner)
{
rt_node *newnode;
- newnode = rt_alloc_node(tree, kind, inner);
- rt_init_node(newnode, kind, shift, chunk, inner);
+ newnode = rt_alloc_node(tree, size_class, inner);
+ rt_init_node(newnode, kind, size_class, shift, chunk, inner);
return newnode;
}
@@ -866,20 +890,20 @@ rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk, bool
* Allocate a new node with the given node kind.
*/
static rt_node *
-rt_alloc_node(radix_tree *tree, int kind, bool inner)
+rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
{
rt_node *newnode;
if (inner)
- newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
- rt_node_kind_info[kind].inner_size);
+ newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[size_class],
+ rt_size_class_info[size_class].inner_size);
else
- newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
- rt_node_kind_info[kind].leaf_size);
+ newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[size_class],
+ rt_size_class_info[size_class].leaf_size);
#ifdef RT_DEBUG
/* update the statistics */
- tree->cnt[kind]++;
+ tree->cnt[size_class]++;
#endif
return newnode;
@@ -887,14 +911,16 @@ rt_alloc_node(radix_tree *tree, int kind, bool inner)
/* Initialize the node contents */
static inline void
-rt_init_node(rt_node *node, uint8 kind, uint8 shift, uint8 chunk, bool inner)
+rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, uint8 shift, uint8 chunk,
+ bool inner)
{
if (inner)
- MemSet(node, 0, rt_node_kind_info[kind].inner_size);
+ MemSet(node, 0, rt_size_class_info[size_class].inner_size);
else
- MemSet(node, 0, rt_node_kind_info[kind].leaf_size);
+ MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
node->kind = kind;
+ node->fanout = rt_size_class_info[size_class].fanout;
node->shift = shift;
node->chunk = chunk;
node->count = 0;
@@ -912,13 +938,13 @@ rt_init_node(rt_node *node, uint8 kind, uint8 shift, uint8 chunk, bool inner)
* Create a new node with 'new_kind' and the same shift, chunk, and
* count of 'node'.
*/
-static rt_node *
-rt_grow_node(radix_tree *tree, rt_node *node, int new_kind)
+static rt_node*
+rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
{
- rt_node *newnode;
+ rt_node *newnode;
- newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk,
- node->shift > 0);
+ newnode = rt_alloc_init_node(tree, new_kind, kind_min_size_class[new_kind],
+ node->shift, node->chunk, !NODE_IS_LEAF(node));
newnode->count = node->count;
return newnode;
@@ -928,6 +954,8 @@ rt_grow_node(radix_tree *tree, rt_node *node, int new_kind)
static void
rt_free_node(radix_tree *tree, rt_node *node)
{
+ int i;
+
/* If we're deleting the root node, make the tree empty */
if (tree->root == node)
{
@@ -937,8 +965,14 @@ rt_free_node(radix_tree *tree, rt_node *node)
#ifdef RT_DEBUG
/* update the statistics */
- tree->cnt[node->kind]--;
- Assert(tree->cnt[node->kind] >= 0);
+ // FIXME
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == rt_size_class_info[i].fanout)
+ break;
+ }
+ tree->cnt[i]--;
+ Assert(tree->cnt[i] >= 0);
#endif
pfree(node);
@@ -987,7 +1021,7 @@ rt_extend(radix_tree *tree, uint64 key)
{
rt_node_inner_4 *node;
- node = (rt_node_inner_4 *) rt_alloc_init_node(tree, RT_NODE_KIND_4,
+ node = (rt_node_inner_4 *) rt_alloc_init_node(tree, RT_NODE_KIND_4, RT_CLASS_4_FULL,
shift, 0, true);
node->base.n.count = 1;
node->base.chunks[0] = 0;
@@ -1017,7 +1051,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
rt_node *newchild;
int newshift = shift - RT_NODE_SPAN;
- newchild = rt_alloc_init_node(tree, RT_NODE_KIND_4, newshift,
+ newchild = rt_alloc_init_node(tree, RT_NODE_KIND_4, RT_CLASS_4_FULL, newshift,
RT_GET_KEY_CHUNK(key, node->shift),
newshift > 0);
rt_node_insert_inner(tree, parent, node, key, newchild);
@@ -1248,8 +1282,8 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
rt_node_inner_32 *new32;
/* grow node from 4 to 32 */
- new32 = (rt_node_inner_32 *) rt_grow_node(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
chunk_children_array_copy(n4->base.chunks, n4->children,
new32->base.chunks, new32->children,
n4->base.n.count);
@@ -1294,8 +1328,8 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
rt_node_inner_128 *new128;
/* grow node from 32 to 128 */
- new128 = (rt_node_inner_128 *) rt_grow_node(tree, (rt_node *) n32,
- RT_NODE_KIND_128);
+ new128 = (rt_node_inner_128 *) rt_grow_node_kind(tree, (rt_node *) n32,
+ RT_NODE_KIND_128);
for (int i = 0; i < n32->base.n.count; i++)
node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
@@ -1337,8 +1371,8 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
rt_node_inner_256 *new256;
/* grow node from 128 to 256 */
- new256 = (rt_node_inner_256 *) rt_grow_node(tree, (rt_node *) n128,
- RT_NODE_KIND_256);
+ new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n128,
+ RT_NODE_KIND_256);
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
{
if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
@@ -1365,7 +1399,8 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
- Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+ Assert(n256->base.n.fanout == 0);
+ Assert(chunk_exists || ((rt_node *) n256)->count < RT_NODE_MAX_SLOTS);
node_inner_256_set(n256, chunk, child);
break;
@@ -1416,8 +1451,8 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
rt_node_leaf_32 *new32;
/* grow node from 4 to 32 */
- new32 = (rt_node_leaf_32 *) rt_grow_node(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
chunk_values_array_copy(n4->base.chunks, n4->values,
new32->base.chunks, new32->values,
n4->base.n.count);
@@ -1462,8 +1497,8 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
rt_node_leaf_128 *new128;
/* grow node from 32 to 128 */
- new128 = (rt_node_leaf_128 *) rt_grow_node(tree, (rt_node *) n32,
- RT_NODE_KIND_128);
+ new128 = (rt_node_leaf_128 *) rt_grow_node_kind(tree, (rt_node *) n32,
+ RT_NODE_KIND_128);
for (int i = 0; i < n32->base.n.count; i++)
node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
@@ -1505,7 +1540,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
rt_node_leaf_256 *new256;
/* grow node from 128 to 256 */
- new256 = (rt_node_leaf_256 *) rt_grow_node(tree, (rt_node *) n128,
+ new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n128,
RT_NODE_KIND_256);
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
{
@@ -1533,7 +1568,8 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
- Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+ Assert(((rt_node *) n256)->fanout == 0);
+ Assert(chunk_exists || ((rt_node *) n256)->count < 256);
node_leaf_256_set(n256, chunk, value);
break;
@@ -1571,16 +1607,16 @@ rt_create(MemoryContext ctx)
tree->num_keys = 0;
/* Create the slab allocator for each size class */
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
tree->inner_slabs[i] = SlabContextCreate(ctx,
- rt_node_kind_info[i].name,
- rt_node_kind_info[i].inner_blocksize,
- rt_node_kind_info[i].inner_size);
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].inner_size);
tree->leaf_slabs[i] = SlabContextCreate(ctx,
- rt_node_kind_info[i].name,
- rt_node_kind_info[i].leaf_blocksize,
- rt_node_kind_info[i].leaf_size);
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].leaf_blocksize,
+ rt_size_class_info[i].leaf_size);
#ifdef RT_DEBUG
tree->cnt[i] = 0;
#endif
@@ -1597,7 +1633,7 @@ rt_create(MemoryContext ctx)
void
rt_free(radix_tree *tree)
{
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
MemoryContextDelete(tree->inner_slabs[i]);
MemoryContextDelete(tree->leaf_slabs[i]);
@@ -2099,7 +2135,7 @@ rt_memory_usage(radix_tree *tree)
{
Size total = sizeof(radix_tree);
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
@@ -2189,10 +2225,10 @@ rt_stats(radix_tree *tree)
ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
tree->num_keys,
tree->root->shift / RT_NODE_SPAN,
- tree->cnt[0],
- tree->cnt[1],
- tree->cnt[2],
- tree->cnt[3])));
+ tree->cnt[RT_CLASS_4_FULL],
+ tree->cnt[RT_CLASS_32_FULL],
+ tree->cnt[RT_CLASS_128_FULL],
+ tree->cnt[RT_CLASS_256])));
}
static void
@@ -2200,11 +2236,12 @@ rt_dump_node(rt_node *node, int level, bool recurse)
{
char space[128] = {0};
- fprintf(stderr, "[%s] kind %d, count %u, shift %u, chunk 0x%X:\n",
+ fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
NODE_IS_LEAF(node) ? "LEAF" : "INNR",
(node->kind == RT_NODE_KIND_4) ? 4 :
(node->kind == RT_NODE_KIND_32) ? 32 :
(node->kind == RT_NODE_KIND_128) ? 128 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
node->count, node->shift, node->chunk);
if (level > 0)
@@ -2408,13 +2445,14 @@ rt_dump_search(radix_tree *tree, uint64 key)
void
rt_dump(radix_tree *tree)
{
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
- rt_node_kind_info[i].name,
- rt_node_kind_info[i].inner_size,
- rt_node_kind_info[i].inner_blocksize,
- rt_node_kind_info[i].leaf_size,
- rt_node_kind_info[i].leaf_blocksize);
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_size,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].leaf_size,
+ rt_size_class_info[i].leaf_blocksize);
fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
if (!tree->root)
--
2.31.1
v11-0003-tool-for-measuring-radix-tree-performance.patchapplication/x-patch; name=v11-0003-tool-for-measuring-radix-tree-performance.patchDownload
From 496f70836c2828ebca4cc025e933ae7355807292 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v11 3/6] tool for measuring radix tree performance
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 65 ++
contrib/bench_radix_tree/bench_radix_tree.c | 554 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
6 files changed, 675 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..67ba568531
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,65 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..e69be48448
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,554 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ nulls[2] = true;
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ nulls[2] = false;
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 search_time_ms;
+ Datum values[2] = {0};
+ bool nulls[2] = {0};
+ /* from trial and error */
+ uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (!PG_ARGISNULL(1))
+ {
+ char *filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+ if (sscanf(filter_str, "0x%lX", &filter) == 0)
+ elog(ERROR, "invalid filter string %s", filter_str);
+ }
+
+ elog(NOTICE, "bench with filter 0x%lX", filter);
+
+ rt = rt_create(CurrentMemoryContext);
+
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+
+ rt_set(rt, key, key);
+ }
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
--
2.31.1
v11-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/x-patch; name=v11-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload
From 8d2df83bfaf7ec598292fe1e29446b5d02c278a3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v11 1/6] introduce vector8_min and vector8_highbit_mask
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..0b288c422a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
static inline bool vector8_has_zero(const Vector8 v);
static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
#endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
#endif
}
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+ uint32 mask = 0;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+ return mask;
+#endif
+}
+
/*
* Exactly like vector8_is_highbit_set except for the input type, so it
* looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.31.1
v11-0002-Add-radix-implementation.patchapplication/x-patch; name=v11-0002-Add-radix-implementation.patchDownload
From f1c3bad56571261cc85c6bce596e652a5c028448 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v11 2/6] Add radix implementation.
---
src/backend/lib/Makefile | 1 +
src/backend/lib/meson.build | 1 +
src/backend/lib/radixtree.c | 2428 +++++++++++++++++
src/include/lib/radixtree.h | 42 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 32 +
src/test/modules/test_radixtree/meson.build | 34 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 582 ++++
.../test_radixtree/test_radixtree.control | 4 +
15 files changed, 3175 insertions(+)
create mode 100644 src/backend/lib/radixtree.c
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
integerset.o \
knapsack.o \
pairingheap.o \
+ radixtree.o \
rbtree.o \
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 48da1bddce..4303d306cd 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -9,4 +9,5 @@ backend_sources += files(
'knapsack.c',
'pairingheap.c',
'rbtree.c',
+ 'radixtree.c',
)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..cc1a629fed
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2428 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves". We
+ * choose it to avoid an additional pointer traversal. It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create - Create a new, empty radix tree
+ * rt_free - Free the radix tree
+ * rt_search - Search a key-value pair
+ * rt_set - Set a key-value pair
+ * rt_delete - Delete a key-value pair
+ * rt_begin_iterate - Begin iterating through all key-value pairs
+ * rt_iterate_next - Return next key-value pair, if any
+ * rt_end_iter - End iteration
+ * rt_memory_usage - Get the memory usage
+ * rt_num_entries - Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-128 */
+#define RT_NODE_128_INVALID_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+ RT_ACTION_FIND = 0, /* find the key-value */
+ RT_ACTION_DELETE, /* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree node kinds.
+ *
+ * XXX: These are currently not well chosen. To reduce memory fragmentation
+ * smaller class should optimally fit neatly into the next larger class
+ * (except perhaps at the lowest end). Right now its
+ * 40/40 -> 296/286 -> 1288/1304 -> 2056/2088 bytes for inner nodes and
+ * leaf nodes, respectively, leading to large amount of allocator padding
+ * with aset.c. Hence the use of slab.
+ *
+ * XXX: need to have node-1 until there is no path compression optimization?
+ *
+ * XXX: need to explain why we choose these node types based on benchmark
+ * results etc.
+ */
+#define RT_NODE_KIND_4 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_128 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+ uint8 chunk;
+
+ /* Size kind of the node */
+ uint8 kind;
+} rt_node;
+#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
+#define NODE_HAS_FREE_SLOT(n) \
+ (((rt_node *) (n))->count < rt_node_kind_info[((rt_node *) (n))->kind].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+typedef struct rt_node_base_4
+{
+ rt_node n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+ rt_node n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-128 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 128 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base128
+{
+ rt_node n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+} rt_node_base_128;
+
+typedef struct rt_node_base256
+{
+ rt_node n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * There are separate from inner node size classes for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ * width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+ rt_node_base_4 base;
+
+ /* 4 children, for key chunks */
+ rt_node *children[4];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+ rt_node_base_4 base;
+
+ /* 4 values, for key chunks */
+ uint64 values[4];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+ rt_node_base_32 base;
+
+ /* 32 children, for key chunks */
+ rt_node *children[32];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+ rt_node_base_32 base;
+
+ /* 32 values, for key chunks */
+ uint64 values[32];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_128
+{
+ rt_node_base_128 base;
+
+ /* Slots for 128 children */
+ rt_node *children[128];
+} rt_node_inner_128;
+
+typedef struct rt_node_leaf_128
+{
+ rt_node_base_128 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(128)];
+
+ /* Slots for 128 values */
+ uint64 values[128];
+} rt_node_leaf_128;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+ rt_node_base_256 base;
+
+ /* Slots for 256 children */
+ rt_node *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+ rt_node_base_256 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ uint64 values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information of each size kinds */
+typedef struct rt_node_kind_info_elem
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+
+ /* slab block size */
+ Size inner_blocksize;
+ Size leaf_blocksize;
+} rt_node_kind_info_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * size, (size) * 32)
+static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
+
+ [RT_NODE_KIND_4] = {
+ .name = "radix tree node 4",
+ .fanout = 4,
+ .inner_size = sizeof(rt_node_inner_4),
+ .leaf_size = sizeof(rt_node_leaf_4),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4)),
+ },
+ [RT_NODE_KIND_32] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(rt_node_inner_32),
+ .leaf_size = sizeof(rt_node_leaf_32),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32)),
+ },
+ [RT_NODE_KIND_128] = {
+ .name = "radix tree node 128",
+ .fanout = 128,
+ .inner_size = sizeof(rt_node_inner_128),
+ .leaf_size = sizeof(rt_node_leaf_128),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128)),
+ },
+ [RT_NODE_KIND_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(rt_node_inner_256),
+ .leaf_size = sizeof(rt_node_leaf_256),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+ },
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+ rt_node *node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+ radix_tree *tree;
+
+ /* Track the iteration on nodes of each level */
+ rt_node_iter stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ rt_node *root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+ MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_NODE_KIND_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node * rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift,
+ uint8 chunk, bool inner);
+static inline void rt_init_node(rt_node *node, uint8 kind, uint8 shift, uint8 chunk,
+ bool inner);
+static rt_node *rt_alloc_node(radix_tree *tree, int kind, bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+ rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+ uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+ uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ /* For better code generation */
+ if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ pg_unreachable();
+
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+ uint8 *dst_chunks, uint64 *dst_values, int count)
+{
+ /* For better code generation */
+ if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ pg_unreachable();
+
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_values, src_values, sizeof(uint64) * count);
+}
+
+/* Functions to manipulate inner and leaf node-128 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_128_is_chunk_used(rt_node_base_128 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return (node->children[slot] != NULL);
+}
+
+static inline bool
+node_leaf_128_is_slot_used(rt_node_leaf_128 *node, uint8 slot)
+{
+ Assert(NODE_IS_LEAF(node));
+ return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+static inline rt_node *
+node_inner_128_get_child(rt_node_inner_128 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_128_get_value(rt_node_leaf_128 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(((rt_node_base_128 *) node)->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+static void
+node_inner_128_delete(rt_node_inner_128 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+static void
+node_leaf_128_delete(rt_node_leaf_128 *node, uint8 chunk)
+{
+ int slotpos = node->base.slot_idxs[chunk];
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+ node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+/* Return an unused slot in node-128 */
+static int
+node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
+{
+ int slotpos = 0;
+
+ Assert(!NODE_IS_LEAF(node));
+ while (node_inner_128_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+static int
+node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ /* We iterate over the isset bitmap per byte then check each bit */
+ for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+ {
+ if (node->isset[slotpos] < 0xFF)
+ break;
+ }
+ Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+ slotpos *= BITS_PER_BYTE;
+ while (node_leaf_128_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+static inline void
+node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+{
+ int slotpos;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ /* find unused slot */
+ slotpos = node_inner_128_find_unused_slot(node, chunk);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ /* find unused slot */
+ slotpos = node_leaf_128_find_unused_slot(node, chunk);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+ node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+node_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+static inline void
+node_leaf_128_update(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(node_inner_256_is_chunk_used(node, chunk));
+ return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(node_leaf_256_is_chunk_used(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+ int shift = key_get_shift(key);
+ rt_node *node;
+
+ node = (rt_node *) rt_alloc_init_node(tree, RT_NODE_KIND_4, shift, 0,
+ shift > 0);
+ tree->max_val = shift_get_max_val(shift);
+ tree->root = node;
+}
+
+/* Return a new and initialized node */
+static rt_node *
+rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk, bool inner)
+{
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, kind, inner);
+ rt_init_node(newnode, kind, shift, chunk, inner);
+
+ return newnode;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, int kind, bool inner)
+{
+ rt_node *newnode;
+
+ if (inner)
+ newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_kind_info[kind].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_kind_info[kind].leaf_size);
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[kind]++;
+#endif
+
+ return newnode;
+}
+
+/* Initialize the node contents */
+static inline void
+rt_init_node(rt_node *node, uint8 kind, uint8 shift, uint8 chunk, bool inner)
+{
+ if (inner)
+ MemSet(node, 0, rt_node_kind_info[kind].inner_size);
+ else
+ MemSet(node, 0, rt_node_kind_info[kind].leaf_size);
+
+ node->kind = kind;
+ node->shift = shift;
+ node->chunk = chunk;
+ node->count = 0;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_128)
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+
+ memset(n128->slot_idxs, RT_NODE_128_INVALID_IDX, sizeof(n128->slot_idxs));
+ }
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node *
+rt_grow_node(radix_tree *tree, rt_node *node, int new_kind)
+{
+ rt_node *newnode;
+
+ newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk,
+ node->shift > 0);
+ newnode->count = node->count;
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->root == node)
+ {
+ tree->root = NULL;
+ tree->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[node->kind]--;
+ Assert(tree->cnt[node->kind] >= 0);
+#endif
+
+ pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+ rt_node *new_child, uint64 key)
+{
+ Assert(old_child->chunk == new_child->chunk);
+ Assert(old_child->shift == new_child->shift);
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new large node */
+ tree->root = new_child;
+ }
+ else
+ {
+ bool replaced PG_USED_FOR_ASSERTS_ONLY;
+
+ replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+ Assert(replaced);
+ }
+
+ rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+ int target_shift;
+ int shift = tree->root->shift + RT_NODE_SPAN;
+
+ target_shift = key_get_shift(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ rt_node_inner_4 *node;
+
+ node = (rt_node_inner_4 *) rt_alloc_init_node(tree, RT_NODE_KIND_4,
+ shift, 0, true);
+ node->base.n.count = 1;
+ node->base.chunks[0] = 0;
+ node->children[0] = tree->root;
+
+ tree->root->chunk = 0;
+ tree->root = (rt_node *) node;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+ rt_node *node)
+{
+ int shift = node->shift;
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ rt_node *newchild;
+ int newshift = shift - RT_NODE_SPAN;
+
+ newchild = rt_alloc_init_node(tree, RT_NODE_KIND_4, newshift,
+ RT_GET_KEY_CHUNK(key, node->shift),
+ newshift > 0);
+ rt_node_insert_inner(tree, parent, node, key, newchild);
+
+ parent = node;
+ node = newchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ rt_node_insert_leaf(tree, parent, node, key, value);
+ tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ rt_node *child = NULL;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = n4->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n4->base.chunks, n4->children,
+ n4->base.n.count, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = n32->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = node_inner_128_get_child(n128, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_128_delete(n128, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = node_inner_256_get_child(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ if (found && child_p)
+ *child_p = child;
+
+ return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ uint64 value = 0;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = n4->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+ n4->base.n.count, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = n32->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+ n32->base.n.count, idx);
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_128_get_value(n128, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_128_delete(n128, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_256_get_value(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ if (found && value_p)
+ *value_p = value;
+
+ return found;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+ rt_node *child)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->children[idx] = child;
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+ {
+ rt_node_inner_32 *new32;
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_inner_32 *) rt_grow_node(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_children_array_copy(n4->base.chunks, n4->children,
+ new32->base.chunks, new32->children,
+ n4->base.n.count);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+ key);
+ node = (rt_node *) new32;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ uint16 count = n4->base.n.count;
+
+ /* shift chunks and children */
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n4->base.chunks, n4->children,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->children[insertpos] = child;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->children[idx] = child;
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+ {
+ rt_node_inner_128 *new128;
+
+ /* grow node from 32 to 128 */
+ new128 = (rt_node_inner_128 *) rt_grow_node(tree, (rt_node *) n32,
+ RT_NODE_KIND_128);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+ key);
+ node = (rt_node *) new128;
+ }
+ else
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int16 count = n32->base.n.count;
+
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n32->base.chunks, n32->children,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->children[insertpos] = child;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_128:
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+ int cnt = 0;
+
+ if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ node_inner_128_update(n128, chunk, child);
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+ {
+ rt_node_inner_256 *new256;
+
+ /* grow node from 128 to 256 */
+ new256 = (rt_node_inner_256 *) rt_grow_node(tree, (rt_node *) n128,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+ {
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ continue;
+
+ node_inner_256_set(new256, i, node_inner_128_get_child(n128, i));
+ cnt++;
+ }
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ else
+ {
+ node_inner_128_insert(n128, chunk, child);
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+
+ node_inner_256_set(n256, chunk, child);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(NODE_IS_LEAF(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->values[idx] = value;
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+ {
+ rt_node_leaf_32 *new32;
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_leaf_32 *) rt_grow_node(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_values_array_copy(n4->base.chunks, n4->values,
+ new32->base.chunks, new32->values,
+ n4->base.n.count);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+ key);
+ node = (rt_node *) new32;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ int count = n4->base.n.count;
+
+ /* shift chunks and values */
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n4->base.chunks, n4->values,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->values[insertpos] = value;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->values[idx] = value;
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+ {
+ rt_node_leaf_128 *new128;
+
+ /* grow node from 32 to 128 */
+ new128 = (rt_node_leaf_128 *) rt_grow_node(tree, (rt_node *) n32,
+ RT_NODE_KIND_128);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+ key);
+ node = (rt_node *) new128;
+ }
+ else
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int count = n32->base.n.count;
+
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n32->base.chunks, n32->values,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->values[insertpos] = value;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_128:
+ {
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+ int cnt = 0;
+
+ if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ node_leaf_128_update(n128, chunk, value);
+ break;
+ }
+
+ if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+ {
+ rt_node_leaf_256 *new256;
+
+ /* grow node from 128 to 256 */
+ new256 = (rt_node_leaf_256 *) rt_grow_node(tree, (rt_node *) n128,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+ {
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ continue;
+
+ node_leaf_256_set(new256, i, node_leaf_128_get_value(n128, i));
+ cnt++;
+ }
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ else
+ {
+ node_leaf_128_insert(n128, chunk, value);
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+
+ node_leaf_256_set(n256, chunk, value);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+ radix_tree *tree;
+ MemoryContext old_ctx;
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = palloc(sizeof(radix_tree));
+ tree->context = ctx;
+ tree->root = NULL;
+ tree->max_val = 0;
+ tree->num_keys = 0;
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].inner_blocksize,
+ rt_node_kind_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].leaf_blocksize,
+ rt_node_kind_info[i].leaf_size);
+#ifdef RT_DEBUG
+ tree->cnt[i] = 0;
+#endif
+ }
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+ int shift;
+ bool updated;
+ rt_node *node;
+ rt_node *parent;
+
+ /* Empty tree, create the root */
+ if (!tree->root)
+ rt_new_root(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->max_val)
+ rt_extend(tree, key);
+
+ Assert(tree->root);
+
+ shift = tree->root->shift;
+ node = parent = tree->root;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ {
+ rt_set_extend(tree, key, value, parent, node);
+ return false;
+ }
+
+ parent = node;
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->num_keys++;
+
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+ rt_node *node;
+ int shift;
+
+ Assert(value_p != NULL);
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ node = tree->root;
+ shift = tree->root->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ rt_node *stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ node = tree->root;
+ shift = tree->root->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ rt_node *child;
+
+ /* Push the current node to the stack */
+ stack[++level] = node;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ Assert(NODE_IS_LEAF(node));
+ deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (!NODE_IS_EMPTY(node))
+ return true;
+
+ /* Free the empty leaf node */
+ rt_free_node(tree, node);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ node = stack[level--];
+
+ deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!NODE_IS_EMPTY(node))
+ break;
+
+ /* The node became empty */
+ rt_free_node(tree, node);
+ }
+
+ return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+ MemoryContext old_ctx;
+ rt_iter *iter;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (rt_iter *) palloc0(sizeof(rt_iter));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree->root)
+ return iter;
+
+ top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+ int level = from;
+ rt_node *node = from_node;
+
+ for (;;)
+ {
+ rt_node_iter *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = rt_node_inner_iterate_next(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->root)
+ return false;
+
+ for (;;)
+ {
+ rt_node *child = NULL;
+ uint64 value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ rt_update_iter_stack(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+ pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+ rt_node *child = NULL;
+ bool found = false;
+ uint8 key_chunk;
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+
+ child = n4->children[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+ child = n32->children[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_128_get_child(n128, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_inner_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_256_get_child(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+ return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p)
+{
+ rt_node *node = node_iter->node;
+ bool found = false;
+ uint64 value;
+ uint8 key_chunk;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+
+ value = n4->values[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+ value = n32->values[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_128_get_value(n128, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_leaf_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_256_get_value(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ *value_p = value;
+ }
+
+ return found;
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+ return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+ Size total = sizeof(radix_tree);
+
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+ for (int i = 1; i < n4->n.count; i++)
+ Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_128_is_chunk_used(n128, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ if (NODE_IS_LEAF(node))
+ Assert(node_leaf_128_is_slot_used((rt_node_leaf_128 *) node,
+ n128->slot_idxs[i]));
+ else
+ Assert(node_inner_128_is_slot_used((rt_node_inner_128 *) node,
+ n128->slot_idxs[i]));
+
+ cnt++;
+ }
+
+ Assert(n128->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+ cnt += pg_popcount32(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+ ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
+ tree->num_keys,
+ tree->root->shift / RT_NODE_SPAN,
+ tree->cnt[0],
+ tree->cnt[1],
+ tree->cnt[2],
+ tree->cnt[3])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+ char space[128] = {0};
+
+ fprintf(stderr, "[%s] kind %d, count %u, shift %u, chunk 0x%X:\n",
+ NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_128) ? 128 : 256,
+ node->count, node->shift, node->chunk);
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n4->base.chunks[i], n4->values[i]);
+ }
+ else
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n4->base.chunks[i]);
+
+ if (recurse)
+ rt_dump_node(n4->children[i], level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n32->base.chunks[i], n32->values[i]);
+ }
+ else
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ rt_dump_node(n32->children[i], level + 1, recurse);
+ }
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_128:
+ {
+ rt_node_base_128 *b128 = (rt_node_base_128 *) node;
+
+ fprintf(stderr, "slot_idxs ");
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_128_is_chunk_used(b128, i))
+ continue;
+
+ fprintf(stderr, " [%d]=%d, ", i, b128->slot_idxs[i]);
+ }
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_128 *n = (rt_node_leaf_128 *) node;
+
+ fprintf(stderr, ", isset-bitmap:");
+ for (int i = 0; i < 16; i++)
+ {
+ fprintf(stderr, "%X ", (uint8) n->isset[i]);
+ }
+ fprintf(stderr, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_128_is_chunk_used(b128, i))
+ continue;
+
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) b128;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, node_leaf_128_get_value(n128, i));
+ }
+ else
+ {
+ rt_node_inner_128 *n128 = (rt_node_inner_128 *) b128;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_128_get_child(n128, i),
+ level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, node_leaf_256_get_value(n256, i));
+ }
+ else
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+ recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+ tree->max_val, tree->max_val);
+
+ if (!tree->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->max_val)
+ {
+ elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+ key, key);
+ return;
+ }
+
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ rt_dump_node(node, level, false);
+
+ if (NODE_IS_LEAF(node))
+ {
+ uint64 dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+ break;
+ }
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+ rt_node_kind_info[i].name,
+ rt_node_kind_info[i].inner_size,
+ rt_node_kind_info[i].inner_blocksize,
+ rt_node_kind_info[i].leaf_size,
+ rt_node_kind_info[i].leaf_blocksize);
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+
+ if (!tree->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif /* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 96addded81..11d0ec5b07 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -27,6 +27,7 @@ SUBDIRS = \
test_parser \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1d26544854..568823b221 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -21,6 +21,7 @@ subdir('test_oat_hooks')
subdir('test_parser')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..5242538cec
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,32 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with node 4
+NOTICE: testing basic operations with node 32
+NOTICE: testing basic operations with node 128
+NOTICE: testing basic operations with node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..4198d7e976
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,582 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ uint64 dummy;
+ uint64 key;
+ uint64 val;
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+}
+
+static void
+test_basic(int children)
+{
+ radix_tree *radixtree;
+
+ elog(NOTICE, "testing basic operations with node %d", children);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /* insert key-value pairs like 1, 32, 2, 31, 3, 30 ... */
+ for (int i = 0; i < children / 2; i++)
+ {
+ uint64 x;
+
+ x = i + 1;
+ if (rt_set(radixtree, x, x))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT "is found", x);
+ x = children - i;
+ if (rt_set(radixtree, x, x))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT "is found", x);
+ }
+
+ /* update these keys */
+ for (int i = 0; i < children / 2; i++)
+ {
+ uint64 x;
+
+ x = i + 1;
+ if (!rt_set(radixtree, x, x + 1))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, x);
+ x = children - i;
+ if (!rt_set(radixtree, x, x + 1))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, x);
+ }
+
+ /* delete these keys */
+ for (int i = 0; i < children / 2; i++)
+ {
+ uint64 x;
+
+ x = i + 1;
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ x = children - i;
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, x);
+ }
+
+ rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ uint64 val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+ static int rt_node_max_entries[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 128, /* RT_NODE_KIND_128 */
+ 256 /* RT_NODE_KIND_256 */
+ };
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_max_entries[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_max_entries[node_kind_idx - 1]
+ : rt_node_max_entries[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_max_entries[node_kind_idx]
+ : rt_node_max_entries[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ radix_tree *radixtree;
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+ radixtree = rt_create(radixtree_ctx);
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ test_basic(4);
+ test_basic(32);
+ test_basic(128);
+ test_basic(256);
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
--
2.31.1
v11-0005-Make-all-node-kinds-variable-sized.patchapplication/x-patch; name=v11-0005-Make-all-node-kinds-variable-sized.patchDownload
From 5bab5b1c57233ceecaa46cb155e7b0f1e9e7d2b5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 24 Nov 2022 12:02:22 +0900
Subject: [PATCH v11 5/6] Make all node kinds variable sized
Add one size class 1, 15, and 61, for each node kind 4, 32, and 128,
respectively. The inner and leaf node size with the new size classes
are 24/24, 160/160, and 752/768, respectively.
For example in size class 15, when a 16th element is to be inserted,
allocte a larger area and memcpy the entire old node to it.
This technique allows us to limit the node kinds to 4, which
1. limits the number of cases in switch statements
2. allows a possible future optimization to encode the node kind
in a pointer tag
---
src/backend/lib/radixtree.c | 470 +++++++++++++++++++++++++-----------
1 file changed, 329 insertions(+), 141 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index b71545e031..f10abd8add 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -133,8 +133,11 @@ typedef enum
typedef enum rt_size_class
{
- RT_CLASS_4_FULL = 0,
+ RT_CLASS_4_PARTIAL = 0,
+ RT_CLASS_4_FULL,
+ RT_CLASS_32_PARTIAL,
RT_CLASS_32_FULL,
+ RT_CLASS_128_PARTIAL,
RT_CLASS_128_FULL,
RT_CLASS_256
@@ -151,6 +154,8 @@ typedef struct rt_node
uint16 count;
/* Max number of children. We can use uint8 because we never need to store 256 */
+ /* WIP: if we don't have a variable sized node4, this should instead be in the base
+ types as needed, since saving every byte is crucial for the smallest node kind */
uint8 fanout;
/*
@@ -168,8 +173,12 @@ typedef struct rt_node
#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
#define NODE_HAS_FREE_SLOT(node) \
((node)->base.n.count < (node)->base.n.fanout)
+#define NODE_NEEDS_TO_GROW_CLASS(node, class) \
+ (((node)->base.n.count) == (rt_size_class_info[(class)].fanout))
/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
typedef struct rt_node_base_4
{
rt_node n;
@@ -221,40 +230,40 @@ typedef struct rt_node_inner_4
{
rt_node_base_4 base;
- /* 4 children, for key chunks */
- rt_node *children[4];
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
} rt_node_inner_4;
typedef struct rt_node_leaf_4
{
rt_node_base_4 base;
- /* 4 values, for key chunks */
- uint64 values[4];
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
} rt_node_leaf_4;
typedef struct rt_node_inner_32
{
rt_node_base_32 base;
- /* 32 children, for key chunks */
- rt_node *children[32];
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
} rt_node_inner_32;
typedef struct rt_node_leaf_32
{
rt_node_base_32 base;
- /* 32 values, for key chunks */
- uint64 values[32];
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
} rt_node_leaf_32;
typedef struct rt_node_inner_128
{
rt_node_base_128 base;
- /* Slots for 128 children */
- rt_node *children[128];
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
} rt_node_inner_128;
typedef struct rt_node_leaf_128
@@ -264,8 +273,8 @@ typedef struct rt_node_leaf_128
/* isset is a bitmap to track which slot is in use */
uint8 isset[RT_NODE_NSLOTS_BITS(128)];
- /* Slots for 128 values */
- uint64 values[128];
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
} rt_node_leaf_128;
/*
@@ -311,32 +320,55 @@ typedef struct rt_size_class_elem
* from the block.
*/
#define NODE_SLAB_BLOCK_SIZE(size) \
- Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * size, (size) * 32)
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
-
+ [RT_CLASS_4_PARTIAL] = {
+ .name = "radix tree node 1",
+ .fanout = 1,
+ .inner_size = sizeof(rt_node_inner_4) + 1 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_4) + 1 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 1 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 1 * sizeof(uint64)),
+ },
[RT_CLASS_4_FULL] = {
.name = "radix tree node 4",
.fanout = 4,
- .inner_size = sizeof(rt_node_inner_4),
- .leaf_size = sizeof(rt_node_leaf_4),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4)),
+ .inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_PARTIAL] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
},
[RT_CLASS_32_FULL] = {
.name = "radix tree node 32",
.fanout = 32,
- .inner_size = sizeof(rt_node_inner_32),
- .leaf_size = sizeof(rt_node_leaf_32),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32)),
+ .inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+ },
+ [RT_CLASS_128_PARTIAL] = {
+ .name = "radix tree node 61",
+ .fanout = 61,
+ .inner_size = sizeof(rt_node_inner_128) + 61 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_128) + 61 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128) + 61 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128) + 61 * sizeof(uint64)),
},
[RT_CLASS_128_FULL] = {
.name = "radix tree node 128",
.fanout = 128,
- .inner_size = sizeof(rt_node_inner_128),
- .leaf_size = sizeof(rt_node_leaf_128),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128)),
+ .inner_size = sizeof(rt_node_inner_128) + 128 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_128) + 128 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128) + 128 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128) + 128 * sizeof(uint64)),
},
[RT_CLASS_256] = {
.name = "radix tree node 256",
@@ -352,9 +384,9 @@ static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
/* Map from the node kind to its minimum size class */
static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
- [RT_NODE_KIND_4] = RT_CLASS_4_FULL,
- [RT_NODE_KIND_32] = RT_CLASS_32_FULL,
- [RT_NODE_KIND_128] = RT_CLASS_128_FULL,
+ [RT_NODE_KIND_4] = RT_CLASS_4_PARTIAL,
+ [RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+ [RT_NODE_KIND_128] = RT_CLASS_128_PARTIAL,
[RT_NODE_KIND_256] = RT_CLASS_256,
};
@@ -867,7 +899,7 @@ rt_new_root(radix_tree *tree, uint64 key)
int shift = key_get_shift(key);
rt_node *node;
- node = (rt_node *) rt_alloc_init_node(tree, RT_NODE_KIND_4, RT_CLASS_4_FULL,
+ node = (rt_node *) rt_alloc_init_node(tree, RT_NODE_KIND_4, RT_CLASS_4_PARTIAL,
shift, 0, shift > 0);
tree->max_val = shift_get_max_val(shift);
tree->root = node;
@@ -965,7 +997,6 @@ rt_free_node(radix_tree *tree, rt_node *node)
#ifdef RT_DEBUG
/* update the statistics */
- // FIXME
for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
if (node->fanout == rt_size_class_info[i].fanout)
@@ -1021,7 +1052,7 @@ rt_extend(radix_tree *tree, uint64 key)
{
rt_node_inner_4 *node;
- node = (rt_node_inner_4 *) rt_alloc_init_node(tree, RT_NODE_KIND_4, RT_CLASS_4_FULL,
+ node = (rt_node_inner_4 *) rt_alloc_init_node(tree, RT_NODE_KIND_4, RT_CLASS_4_PARTIAL,
shift, 0, true);
node->base.n.count = 1;
node->base.chunks[0] = 0;
@@ -1051,7 +1082,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
rt_node *newchild;
int newshift = shift - RT_NODE_SPAN;
- newchild = rt_alloc_init_node(tree, RT_NODE_KIND_4, RT_CLASS_4_FULL, newshift,
+ newchild = rt_alloc_init_node(tree, RT_NODE_KIND_4, RT_CLASS_4_PARTIAL, newshift,
RT_GET_KEY_CHUNK(key, node->shift),
newshift > 0);
rt_node_insert_inner(tree, parent, node, key, newchild);
@@ -1279,33 +1310,63 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
{
- rt_node_inner_32 *new32;
+ Assert(parent != NULL);
- /* grow node from 4 to 32 */
- new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
- chunk_children_array_copy(n4->base.chunks, n4->children,
- new32->base.chunks, new32->children,
- n4->base.n.count);
+ if (NODE_NEEDS_TO_GROW_CLASS(n4, RT_CLASS_4_PARTIAL))
+ {
+ rt_node_inner_4 *new4;
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
- key);
- node = (rt_node *) new32;
+ /*
+ * Use the same node kind, but expand to the next size class. We
+ * copy the entire old node -- the new node is only different in
+ * having additional slots so we only have to change the fanout.
+ */
+ new4 = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+ memcpy(new4, n4, rt_size_class_info[RT_CLASS_4_PARTIAL].inner_size);
+ new4->base.n.fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new4,
+ key);
+
+ /* must update both pointers here */
+ node = (rt_node *) new4;
+ n4 = new4;
+
+ goto retry_insert_inner_4;
+ }
+ else
+ {
+ rt_node_inner_32 *new32;
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_children_array_copy(n4->base.chunks, n4->children,
+ new32->base.chunks, new32->children,
+ n4->base.n.count);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+ key);
+ node = (rt_node *) new32;
+ }
}
else
{
- int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
- uint16 count = n4->base.n.count;
+ retry_insert_inner_4:
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ uint16 count = n4->base.n.count;
- /* shift chunks and children */
- if (count != 0 && insertpos < count)
- chunk_children_array_shift(n4->base.chunks, n4->children,
- count, insertpos);
+ /* shift chunks and children */
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n4->base.chunks, n4->children,
+ count, insertpos);
- n4->base.chunks[insertpos] = chunk;
- n4->children[insertpos] = child;
- break;
+ n4->base.chunks[insertpos] = chunk;
+ n4->children[insertpos] = child;
+ break;
+ }
}
}
/* FALLTHROUGH */
@@ -1325,31 +1386,56 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
{
- rt_node_inner_128 *new128;
+ Assert(parent != NULL);
+
+ if (NODE_NEEDS_TO_GROW_CLASS(n32, RT_CLASS_32_PARTIAL))
+ {
+ /* use the same node kind, but expand to the next size class */
+ rt_node_inner_32 *new32;
- /* grow node from 32 to 128 */
- new128 = (rt_node_inner_128 *) rt_grow_node_kind(tree, (rt_node *) n32,
- RT_NODE_KIND_128);
- for (int i = 0; i < n32->base.n.count; i++)
- node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
+ new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+ memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size);
+ new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
- key);
- node = (rt_node *) new128;
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32,
+ key);
+
+ /* must update both pointers here */
+ node = (rt_node *) new32;
+ n32 = new32;
+
+ goto retry_insert_inner_32;
+ }
+ else
+ {
+ rt_node_inner_128 *new128;
+
+ /* grow node from 32 to 128 */
+ new128 = (rt_node_inner_128 *) rt_grow_node_kind(tree, (rt_node *) n32,
+ RT_NODE_KIND_128);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+ key);
+ node = (rt_node *) new128;
+ }
}
else
{
- int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
- int16 count = n32->base.n.count;
+retry_insert_inner_32:
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int16 count = n32->base.n.count;
- if (count != 0 && insertpos < count)
- chunk_children_array_shift(n32->base.chunks, n32->children,
- count, insertpos);
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n32->base.chunks, n32->children,
+ count, insertpos);
- n32->base.chunks[insertpos] = chunk;
- n32->children[insertpos] = child;
- break;
+ n32->base.chunks[insertpos] = chunk;
+ n32->children[insertpos] = child;
+ break;
+ }
}
}
/* FALLTHROUGH */
@@ -1368,29 +1454,54 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
{
- rt_node_inner_256 *new256;
+ Assert(parent != NULL);
- /* grow node from 128 to 256 */
- new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n128,
- RT_NODE_KIND_256);
- for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+ if (NODE_NEEDS_TO_GROW_CLASS(n128, RT_CLASS_128_PARTIAL))
{
- if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
- continue;
+ /* use the same node kind, but expand to the next size class */
+ rt_node_inner_128 *new128;
- node_inner_256_set(new256, i, node_inner_128_get_child(n128, i));
- cnt++;
+ new128 = (rt_node_inner_128 *) rt_alloc_node(tree, RT_CLASS_128_FULL, true);
+ memcpy(new128, n128, rt_size_class_info[RT_CLASS_128_PARTIAL].inner_size);
+ new128->base.n.fanout = rt_size_class_info[RT_CLASS_128_FULL].fanout;
+
+ rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new128,
+ key);
+
+ /* must update both pointers here */
+ node = (rt_node *) new128;
+ n128 = new128;
+
+ goto retry_insert_inner_128;
}
+ else
+ {
+ rt_node_inner_256 *new256;
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
- key);
- node = (rt_node *) new256;
+ /* grow node from 128 to 256 */
+ new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n128,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+ {
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ continue;
+
+ node_inner_256_set(new256, i, node_inner_128_get_child(n128, i));
+ cnt++;
+ }
+
+ rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
}
else
{
- node_inner_128_insert(n128, chunk, child);
- break;
+ retry_insert_inner_128:
+ {
+ node_inner_128_insert(n128, chunk, child);
+ break;
+ }
}
}
/* FALLTHROUGH */
@@ -1448,33 +1559,57 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
{
- rt_node_leaf_32 *new32;
+ Assert(parent != NULL);
- /* grow node from 4 to 32 */
- new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
- chunk_values_array_copy(n4->base.chunks, n4->values,
- new32->base.chunks, new32->values,
- n4->base.n.count);
+ if (NODE_NEEDS_TO_GROW_CLASS(n4, RT_CLASS_4_PARTIAL))
+ {
+ /* use the same node kind, but expand to the next size class */
+ rt_node_leaf_4 *new4;
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
- key);
- node = (rt_node *) new32;
+ new4 = (rt_node_leaf_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, false);
+ memcpy(new4, n4, rt_size_class_info[RT_CLASS_4_PARTIAL].leaf_size);
+ new4->base.n.fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new4,
+ key);
+
+ /* must update both pointers here */
+ node = (rt_node *) new4;
+ n4 = new4;
+
+ goto retry_insert_leaf_4;
+ }
+ else
+ {
+ rt_node_leaf_32 *new32;
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_values_array_copy(n4->base.chunks, n4->values,
+ new32->base.chunks, new32->values,
+ n4->base.n.count);
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+ key);
+ node = (rt_node *) new32;
+ }
}
else
{
- int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
- int count = n4->base.n.count;
+ retry_insert_leaf_4:
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ int count = n4->base.n.count;
- /* shift chunks and values */
- if (count != 0 && insertpos < count)
- chunk_values_array_shift(n4->base.chunks, n4->values,
- count, insertpos);
+ /* shift chunks and values */
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n4->base.chunks, n4->values,
+ count, insertpos);
- n4->base.chunks[insertpos] = chunk;
- n4->values[insertpos] = value;
- break;
+ n4->base.chunks[insertpos] = chunk;
+ n4->values[insertpos] = value;
+ break;
+ }
}
}
/* FALLTHROUGH */
@@ -1494,31 +1629,56 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
{
- rt_node_leaf_128 *new128;
+ Assert(parent != NULL);
+
+ if (NODE_NEEDS_TO_GROW_CLASS(n32, RT_CLASS_32_PARTIAL))
+ {
+ /* use the same node kind, but expand to the next size class */
+ rt_node_leaf_32 *new32;
- /* grow node from 32 to 128 */
- new128 = (rt_node_leaf_128 *) rt_grow_node_kind(tree, (rt_node *) n32,
- RT_NODE_KIND_128);
- for (int i = 0; i < n32->base.n.count; i++)
- node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
+ new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+ memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size);
+ new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
- key);
- node = (rt_node *) new128;
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32,
+ key);
+
+ /* must update both pointers here */
+ node = (rt_node *) new32;
+ n32 = new32;
+
+ goto retry_insert_leaf_32;
+ }
+ else
+ {
+ rt_node_leaf_128 *new128;
+
+ /* grow node from 32 to 128 */
+ new128 = (rt_node_leaf_128 *) rt_grow_node_kind(tree, (rt_node *) n32,
+ RT_NODE_KIND_128);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+ key);
+ node = (rt_node *) new128;
+ }
}
else
{
- int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
- int count = n32->base.n.count;
+ retry_insert_leaf_32:
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int count = n32->base.n.count;
- if (count != 0 && insertpos < count)
- chunk_values_array_shift(n32->base.chunks, n32->values,
- count, insertpos);
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n32->base.chunks, n32->values,
+ count, insertpos);
- n32->base.chunks[insertpos] = chunk;
- n32->values[insertpos] = value;
- break;
+ n32->base.chunks[insertpos] = chunk;
+ n32->values[insertpos] = value;
+ break;
+ }
}
}
/* FALLTHROUGH */
@@ -1537,29 +1697,54 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
{
- rt_node_leaf_256 *new256;
+ Assert(parent != NULL);
- /* grow node from 128 to 256 */
- new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n128,
- RT_NODE_KIND_256);
- for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+ if (NODE_NEEDS_TO_GROW_CLASS(n128, RT_CLASS_128_PARTIAL))
{
- if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
- continue;
+ /* use the same node kind, but expand to the next size class */
+ rt_node_leaf_128 *new128;
+
+ new128 = (rt_node_leaf_128 *) rt_alloc_node(tree, RT_CLASS_128_FULL, false);
+ memcpy(new128, n128, rt_size_class_info[RT_CLASS_128_PARTIAL].leaf_size);
+ new128->base.n.fanout = rt_size_class_info[RT_CLASS_128_FULL].fanout;
+
+ rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new128,
+ key);
+
+ /* must update both pointers here */
+ node = (rt_node *) new128;
+ n128 = new128;
- node_leaf_256_set(new256, i, node_leaf_128_get_value(n128, i));
- cnt++;
+ goto retry_insert_leaf_128;
}
+ else
+ {
+ rt_node_leaf_256 *new256;
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
- key);
- node = (rt_node *) new256;
+ /* grow node from 128 to 256 */
+ new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n128,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+ {
+ if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+ continue;
+
+ node_leaf_256_set(new256, i, node_leaf_128_get_value(n128, i));
+ cnt++;
+ }
+
+ rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
}
else
{
- node_leaf_128_insert(n128, chunk, value);
- break;
+ retry_insert_leaf_128:
+ {
+ node_leaf_128_insert(n128, chunk, value);
+ break;
+ }
}
}
/* FALLTHROUGH */
@@ -2222,11 +2407,14 @@ rt_verify_node(rt_node *node)
void
rt_stats(radix_tree *tree)
{
- ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
+ ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n1 = %u, n4 = %u, n15 = %u, n32 = %u, n61 = %u, n128 = %u, n256 = %u",
tree->num_keys,
tree->root->shift / RT_NODE_SPAN,
+ tree->cnt[RT_CLASS_4_PARTIAL],
tree->cnt[RT_CLASS_4_FULL],
+ tree->cnt[RT_CLASS_32_PARTIAL],
tree->cnt[RT_CLASS_32_FULL],
+ tree->cnt[RT_CLASS_128_PARTIAL],
tree->cnt[RT_CLASS_128_FULL],
tree->cnt[RT_CLASS_256])));
}
--
2.31.1
On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
So it seems that there are two candidates of rt_node structure: (1)
all nodes except for node256 are variable-size nodes and use pointer
tagging, and (2) node32 and node128 are variable-sized nodes and do
not use pointer tagging (fanout is in part of only these two nodes).
rt_node can be 5 bytes in both cases. But before going to this step, I
started to verify the idea of variable-size nodes by using 6-bytes
rt_node. We can adjust the node kinds and node classes later.
First, I'm glad you picked up the size class concept and expanded it. (I
have some comments about some internal APIs below.)
Let's leave the pointer tagging piece out until the main functionality is
committed. We have all the prerequisites in place, except for a benchmark
random enough to demonstrate benefit. I'm still not quite satisfied with
how the shared memory coding looked, and that is the only sticky problem we
still have, IMO. The rest is "just work".
That said, (1) and (2) above are still relevant -- variable sizing any
given node is optional, and we can refine as needed.
Overall, the idea of variable-sized nodes is good, smaller size
without losing search performance.
Good.
I'm going to check the load
performance as well.
Part of that is this, which gets called a lot more now, when node1 expands:
+ if (inner)
+ newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_kind_info[kind].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_kind_info[kind].leaf_size);
Since memset for expanding size class is now handled separately, these can
use the non-zeroing versions. When compiling MemoryContextAllocZero, the
compiler has no idea how big the size is, so it assumes the worst and
optimizes for large sizes. On x86-64, that means using "rep stos",
which calls microcode found in the CPU's ROM. This is slow for small sizes.
The "init" function should be always inline with const parameters where
possible. That way, memset can compile to a single instruction for the
smallest node kind. (More on alloc/init below)
Note, there is a wrinkle: As currently written inner_node128 searches the
child pointers for NULL when inserting, so when expanding from partial to
full size class, the new node must be zeroed (Worth fixing in the short
term. I thought of this while writing the proof-of-concept for size
classes, but didn't mention it.) Medium term, rather than special-casing
this, I actually want to rewrite the inner-node128 to be more similar to
the leaf, with an "isset" array, but accessed and tested differently. I
guarantee it's *really* slow now to load (maybe somewhat true even for
leaves), but I'll leave the details for later. Regarding node128 leaf, note
that it's slightly larger than a DSA size class, and we can trim it to fit:
node61: 6 + 256+(2) +16 + 61*8 = 768
node125: 6 + 256+(2) +16 + 125*8 = 1280
I've attached the patches I used for the verification. I don't include
patches for pointer tagging, DSA support, and vacuum integration since
I'm investigating the issue on cfbot that Andres reported. Also, I've
modified tests to improve the test coverage.
Sounds good. For v12, I think size classes have proven themselves, so v11's
0002/4/5 can be squashed. Plus, some additional comments:
+/* Return a new and initialized node */
+static rt_node *
+rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk,
bool inner)
+{
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, kind, inner);
+ rt_init_node(newnode, kind, shift, chunk, inner);
+
+ return newnode;
+}
I don't see the point of a function that just calls two functions.
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node *
+rt_grow_node(radix_tree *tree, rt_node *node, int new_kind)
+{
+ rt_node *newnode;
+
+ newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk,
+ node->shift > 0);
+ newnode->count = node->count;
+
+ return newnode;
+}
This, in turn, just calls a function that does _almost_ everything, and
additionally must set one member. This function should really be alloc-node
+ init-node + copy-common, where copy-common is like in the prototype:
+ newnode->node_shift = oldnode->node_shift;
+ newnode->node_chunk = oldnode->node_chunk;
+ newnode->count = oldnode->count;
And init-node should really be just memset + set kind + set initial fanout.
It has no business touching "shift" and "chunk". The callers rt_new_root,
rt_set_extend, and rt_extend set some values of their own anyway, so let
them set those, too -- it might even improve readability.
- if (n32->base.n.fanout ==
rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+ if (NODE_NEEDS_TO_GROW_CLASS(n32, RT_CLASS_32_PARTIAL))
This macro doesn't really improve readability -- it obscures what is being
tested, and the name implies the "else" branch means "node doesn't need to
grow class", which is false. If we want to simplify expressions in this
block, I think it'd be more effective to improve the lines that follow:
+ memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size);
+ new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
Maybe we can have const variables old_size and new_fanout to break out the
array lookup? While I'm thinking of it, these arrays should be const so the
compiler can avoid runtime lookups. Speaking of...
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+ uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ /* For better code generation */
+ if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ pg_unreachable();
+
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}
When I looked at this earlier, I somehow didn't go far enough -- why are we
passing the runtime count in the first place? This function can only be
called if count == rt_size_class_info[RT_CLASS_4_FULL].fanout. The last
parameter to memcpy should evaluate to a compile-time constant, right? Even
when we add node shrinking in the future, the constant should be correct,
IIUC?
- .fanout = 256,
+ /* technically it's 256, but we can't store that in a uint8,
+ and this is the max size class so it will never grow */
+ .fanout = 0,
- Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+ Assert(((rt_node *) n256)->fanout == 0);
+ Assert(chunk_exists || ((rt_node *) n256)->count < 256);
These hacks were my work, but I think we can improve that by having two
versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized
nodes. For that to work, in "init-node" we'd need a branch to set fanout to
zero for node256. That should be fine -- it already has to branch for
memset'ing node128's indexes to 0xFF.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
[v11]
There is one more thing that just now occurred to me: In expanding the use
of size classes, that makes rebasing and reworking the shared memory piece
more work than it should be. That's important because there are still some
open questions about the design around shared memory. To keep unnecessary
churn to a minimum, perhaps we should limit size class expansion to just
one (or 5 total size classes) for the near future?
--
John Naylor
EDB: http://www.enterprisedb.com
While creating a benchmark for inserting into node128-inner, I found a bug.
If a caller deletes from a node128, the slot index is set to invalid, but
the child pointer is still valid. Do that a few times, and every child
pointer is valid, even if no slot index points to it. When the next
inserter comes along, something surprising happens. This function:
/* Return an unused slot in node-128 */
static int
node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
{
int slotpos = 0;
Assert(!NODE_IS_LEAF(node));
while (node_inner_128_is_slot_used(node, slotpos))
slotpos++;
return slotpos;
}
...passes an integer to this function, whose parameter is a uint8:
/* Is the slot in the node used? */
static inline bool
node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
{
Assert(!NODE_IS_LEAF(node));
return (node->children[slot] != NULL);
}
...so instead of growing the node unnecessarily or segfaulting, it enters
an infinite loop doing this:
add eax, 1
movzx ecx, al
cmp QWORD PTR [rbx+264+rcx*8], 0
jne .L147
The fix is easy enough -- set the child pointer to null upon deletion, but
I'm somewhat astonished that the regression tests didn't hit this. I do
still intend to replace this code with something faster, but before I do so
the tests should probably exercise the deletion paths more. Since VACUUM
--
John Naylor
EDB: http://www.enterprisedb.com
The fix is easy enough -- set the child pointer to null upon deletion,
but I'm somewhat astonished that the regression tests didn't hit this. I do
still intend to replace this code with something faster, but before I do so
the tests should probably exercise the deletion paths more. Since VACUUM
Oops. I meant to finish with "Since VACUUM doesn't perform deletion we
didn't have an opportunity to detect this during that operation."
--
John Naylor
EDB: http://www.enterprisedb.com
There are a few things up in the air, so I'm coming back to this list to
summarize and add a recent update:
On Mon, Nov 14, 2022 at 7:59 PM John Naylor <john.naylor@enterprisedb.com>
wrote:
- See how much performance we actually gain from tagging the node kind.
Needs a benchmark that has enough branch mispredicts and L2/3 misses to
show a benefit. Otherwise either neutral or worse in its current form,
depending on compiler(?). Put off for later.
- Try additional size classes while keeping the node kinds to only four.
This is relatively simple and effective. If only one additional size class
(total 5) is coded as a placeholder, I imagine it will be easier to rebase
shared memory logic than using this technique everywhere possible.
- Optimize node128 insert.
I've attached a rough start at this. The basic idea is borrowed from our
bitmapset nodes, so we can iterate over and operate on word-sized (32- or
64-bit) types at a time, rather than bytes. To make this easier, I've moved
some of the lower-level macros and types from bitmapset.h/.c to
pg_bitutils.h. That's probably going to need a separate email thread to
resolve the coding style clash this causes, so that can be put off for
later. This is not meant to be included in the next patchset. For
demonstration purposes, I get these results with a function that repeatedly
deletes the last value from a mostly-full node128 leaf and re-inserts it:
select * from bench_node128_load(120);
v11
NOTICE: num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0,
n61 = 0, n128 = 121, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
--------+-------+------------------+------------------
120 | 14400 | 208304 | 56
v11 + 0006 addendum
NOTICE: num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0,
n61 = 0, n128 = 121, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
--------+-------+------------------+------------------
120 | 14400 | 208816 | 34
I didn't test inner nodes, but I imagine the difference is bigger. This
bitmap style should also be used for the node256-leaf isset array simply to
be consistent and avoid needing single-use macros, but that has not been
done yet. It won't make a difference for performance because there is no
iteration there.
- Try templating out the differences between local and shared memory.
I hope to start this sometime after the crashes on 32-bit are resolved.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v11-0006-addendum-bitmapword-node128.patch.txttext/plain; charset=US-ASCII; name=v11-0006-addendum-bitmapword-node128.patch.txtDownload
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 67ba568531..2fd689aa91 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -63,3 +63,14 @@ OUT rt_search_ms int8
returns record
as 'MODULE_PATHNAME'
LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index e69be48448..b035b3a747 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -31,6 +31,7 @@ PG_FUNCTION_INFO_V1(bench_shuffle_search);
PG_FUNCTION_INFO_V1(bench_load_random_int);
PG_FUNCTION_INFO_V1(bench_fixed_height_search);
PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
static uint64
tid_to_key_off(ItemPointer tid, uint32 *off)
@@ -552,3 +553,85 @@ finish_search:
rt_free(rt);
PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_sparseload_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ uint64 r,
+ h;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ key_id = 0;
+
+ for (r = 1; r <= fanout; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (h);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+
+ rt_stats(rt);
+
+ /* measure sparse deletion and re-loading */
+ start_time = GetCurrentTimestamp();
+
+ for (int t = 0; t<10000; t++)
+ {
+ /* delete one key in each leaf */
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ rt_delete(rt, key);
+ }
+
+ /* add them all back */
+ key_id = 0;
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_sparseload_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index f10abd8add..9cfed1624f 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -262,6 +262,9 @@ typedef struct rt_node_inner_128
{
rt_node_base_128 base;
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[WORDNUM(128)];
+
/* number of children depends on size class */
rt_node *children[FLEXIBLE_ARRAY_MEMBER];
} rt_node_inner_128;
@@ -271,7 +274,7 @@ typedef struct rt_node_leaf_128
rt_node_base_128 base;
/* isset is a bitmap to track which slot is in use */
- uint8 isset[RT_NODE_NSLOTS_BITS(128)];
+ bitmapword isset[WORDNUM(128)];
/* number of values depends on size class */
uint64 values[FLEXIBLE_ARRAY_MEMBER];
@@ -679,14 +682,14 @@ static inline bool
node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
{
Assert(!NODE_IS_LEAF(node));
- return (node->children[slot] != NULL);
+ return (node->isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
}
static inline bool
node_leaf_128_is_slot_used(rt_node_leaf_128 *node, uint8 slot)
{
Assert(NODE_IS_LEAF(node));
- return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+ return (node->isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
}
static inline rt_node *
@@ -707,7 +710,10 @@ node_leaf_128_get_value(rt_node_leaf_128 *node, uint8 chunk)
static void
node_inner_128_delete(rt_node_inner_128 *node, uint8 chunk)
{
+ int slotpos = node->base.slot_idxs[chunk];
+
Assert(!NODE_IS_LEAF(node));
+ node->isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
}
@@ -717,41 +723,32 @@ node_leaf_128_delete(rt_node_leaf_128 *node, uint8 chunk)
int slotpos = node->base.slot_idxs[chunk];
Assert(NODE_IS_LEAF(node));
- node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+ node->isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
}
/* Return an unused slot in node-128 */
static int
-node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
-{
- int slotpos = 0;
-
- Assert(!NODE_IS_LEAF(node));
- while (node_inner_128_is_slot_used(node, slotpos))
- slotpos++;
-
- return slotpos;
-}
-
-static int
-node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
+node128_find_unused_slot(bitmapword *isset)
{
int slotpos;
+ int idx;
+ bitmapword inverse;
- Assert(NODE_IS_LEAF(node));
-
- /* We iterate over the isset bitmap per byte then check each bit */
- for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < WORDNUM(128); idx++)
{
- if (node->isset[slotpos] < 0xFF)
+ if (isset[idx] < ~((bitmapword) 0))
break;
}
- Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
- slotpos *= BITS_PER_BYTE;
- while (node_leaf_128_is_slot_used(node, slotpos))
- slotpos++;
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+
+ /* mark the slot used */
+ isset[idx] |= RIGHTMOST_ONE(inverse);
return slotpos;
}
@@ -763,8 +760,7 @@ node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
Assert(!NODE_IS_LEAF(node));
- /* find unused slot */
- slotpos = node_inner_128_find_unused_slot(node, chunk);
+ slotpos = node128_find_unused_slot(node->isset);
node->base.slot_idxs[chunk] = slotpos;
node->children[slotpos] = child;
@@ -778,11 +774,9 @@ node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
Assert(NODE_IS_LEAF(node));
- /* find unused slot */
- slotpos = node_leaf_128_find_unused_slot(node, chunk);
+ slotpos = node128_find_unused_slot(node->isset);
node->base.slot_idxs[chunk] = slotpos;
- node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
node->values[slotpos] = value;
}
@@ -2508,9 +2502,9 @@ rt_dump_node(rt_node *node, int level, bool recurse)
rt_node_leaf_128 *n = (rt_node_leaf_128 *) node;
fprintf(stderr, ", isset-bitmap:");
- for (int i = 0; i < 16; i++)
+ for (int i = 0; i < WORDNUM(128); i++)
{
- fprintf(stderr, "%X ", (uint8) n->isset[i]);
+ fprintf(stderr, "%lX ", n->isset[i]);
}
fprintf(stderr, "\n");
}
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index b7b274aeff..3fe0fd88ce 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -23,49 +23,11 @@
#include "common/hashfn.h"
#include "nodes/bitmapset.h"
#include "nodes/pg_list.h"
-#include "port/pg_bitutils.h"
-#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD)
-#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD)
-
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
-
/*
* bms_copy - make a palloc'd copy of a bitmapset
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 2792281658..06fa21ccaa 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -21,33 +21,13 @@
#define BITMAPSET_H
#include "nodes/nodes.h"
+#include "port/pg_bitutils.h"
/*
* Forward decl to save including pg_list.h
*/
struct List;
-/*
- * Data representation
- *
- * Larger bitmap word sizes generally give better performance, so long as
- * they're not wider than the processor can handle efficiently. We use
- * 64-bit words if pointers are that large, else 32-bit words.
- */
-#if SIZEOF_VOID_P >= 8
-
-#define BITS_PER_BITMAPWORD 64
-typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
-
-#else
-
-#define BITS_PER_BITMAPWORD 32
-typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
-
-#endif
-
typedef struct Bitmapset
{
pg_node_attr(custom_copy_equal, special_read_write)
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 814e0b2dba..ad5aa2c5cf 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,51 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*
+ * Platform-specific types
+ *
+ * Larger bitmap word sizes generally give better performance, so long as
+ * they're not wider than the processor can handle efficiently. We use
+ * 64-bit words if pointers are that large, else 32-bit words.
+ */
+#if SIZEOF_VOID_P >= 8
+
+#define BITS_PER_BITMAPWORD 64
+typedef uint64 bitmapword; /* must be an unsigned type */
+typedef int64 signedbitmapword; /* must be the matching signed type */
+
+#else
+
+#define BITS_PER_BITMAPWORD 32
+typedef uint32 bitmapword; /* must be an unsigned type */
+typedef int32 signedbitmapword; /* must be the matching signed type */
+
+#endif
+
+#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD)
+
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
+
+#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
@@ -291,4 +336,17 @@ pg_rotate_left32(uint32 word, int n)
#define pg_prevpower2_size_t pg_prevpower2_64
#endif
+/* variants of some functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_leftmost_one_pos pg_leftmost_one_pos32
+#define bmw_rightmost_one_pos pg_rightmost_one_pos32
+#define bmw_popcount pg_popcount32
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_leftmost_one_pos pg_leftmost_one_pos64
+#define bmw_rightmost_one_pos pg_rightmost_one_pos64
+#define bmw_popcount pg_popcount64
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
+
#endif /* PG_BITUTILS_H */
On Fri, Nov 25, 2022 at 5:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
So it seems that there are two candidates of rt_node structure: (1)
all nodes except for node256 are variable-size nodes and use pointer
tagging, and (2) node32 and node128 are variable-sized nodes and do
not use pointer tagging (fanout is in part of only these two nodes).
rt_node can be 5 bytes in both cases. But before going to this step, I
started to verify the idea of variable-size nodes by using 6-bytes
rt_node. We can adjust the node kinds and node classes later.First, I'm glad you picked up the size class concept and expanded it. (I have some comments about some internal APIs below.)
Let's leave the pointer tagging piece out until the main functionality is committed. We have all the prerequisites in place, except for a benchmark random enough to demonstrate benefit. I'm still not quite satisfied with how the shared memory coding looked, and that is the only sticky problem we still have, IMO. The rest is "just work".
That said, (1) and (2) above are still relevant -- variable sizing any given node is optional, and we can refine as needed.
Overall, the idea of variable-sized nodes is good, smaller size
without losing search performance.Good.
I'm going to check the load
performance as well.Part of that is this, which gets called a lot more now, when node1 expands:
+ if (inner) + newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind], + rt_node_kind_info[kind].inner_size); + else + newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind], + rt_node_kind_info[kind].leaf_size);Since memset for expanding size class is now handled separately, these can use the non-zeroing versions. When compiling MemoryContextAllocZero, the compiler has no idea how big the size is, so it assumes the worst and optimizes for large sizes. On x86-64, that means using "rep stos", which calls microcode found in the CPU's ROM. This is slow for small sizes. The "init" function should be always inline with const parameters where possible. That way, memset can compile to a single instruction for the smallest node kind. (More on alloc/init below)
Right. I forgot to update it.
Note, there is a wrinkle: As currently written inner_node128 searches the child pointers for NULL when inserting, so when expanding from partial to full size class, the new node must be zeroed (Worth fixing in the short term. I thought of this while writing the proof-of-concept for size classes, but didn't mention it.) Medium term, rather than special-casing this, I actually want to rewrite the inner-node128 to be more similar to the leaf, with an "isset" array, but accessed and tested differently. I guarantee it's *really* slow now to load (maybe somewhat true even for leaves), but I'll leave the details for later.
Agreed, I start with zeroing out the node when expanding from partial
to full size.
Regarding node128 leaf, note that it's slightly larger than a DSA size class, and we can trim it to fit:
node61: 6 + 256+(2) +16 + 61*8 = 768
node125: 6 + 256+(2) +16 + 125*8 = 1280
Agreed, changed.
I've attached the patches I used for the verification. I don't include
patches for pointer tagging, DSA support, and vacuum integration since
I'm investigating the issue on cfbot that Andres reported. Also, I've
modified tests to improve the test coverage.Sounds good. For v12, I think size classes have proven themselves, so v11's 0002/4/5 can be squashed. Plus, some additional comments:
+/* Return a new and initialized node */ +static rt_node * +rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk, bool inner) +{ + rt_node *newnode; + + newnode = rt_alloc_node(tree, kind, inner); + rt_init_node(newnode, kind, shift, chunk, inner); + + return newnode; +}I don't see the point of a function that just calls two functions.
Removed.
+/* + * Create a new node with 'new_kind' and the same shift, chunk, and + * count of 'node'. + */ +static rt_node * +rt_grow_node(radix_tree *tree, rt_node *node, int new_kind) +{ + rt_node *newnode; + + newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk, + node->shift > 0); + newnode->count = node->count; + + return newnode; +}This, in turn, just calls a function that does _almost_ everything, and additionally must set one member. This function should really be alloc-node + init-node + copy-common, where copy-common is like in the prototype: + newnode->node_shift = oldnode->node_shift; + newnode->node_chunk = oldnode->node_chunk; + newnode->count = oldnode->count;And init-node should really be just memset + set kind + set initial fanout. It has no business touching "shift" and "chunk". The callers rt_new_root, rt_set_extend, and rt_extend set some values of their own anyway, so let them set those, too -- it might even improve readability.
- if (n32->base.n.fanout == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout) + if (NODE_NEEDS_TO_GROW_CLASS(n32, RT_CLASS_32_PARTIAL))
Agreed.
This macro doesn't really improve readability -- it obscures what is being tested, and the name implies the "else" branch means "node doesn't need to grow class", which is false. If we want to simplify expressions in this block, I think it'd be more effective to improve the lines that follow:
+ memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size); + new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;Maybe we can have const variables old_size and new_fanout to break out the array lookup? While I'm thinking of it, these arrays should be const so the compiler can avoid runtime lookups. Speaking of...
+/* Copy both chunks and children/values arrays */ +static inline void +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children, + uint8 *dst_chunks, rt_node **dst_children, int count) +{ + /* For better code generation */ + if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout) + pg_unreachable(); + + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count); + memcpy(dst_children, src_children, sizeof(rt_node *) * count); +}When I looked at this earlier, I somehow didn't go far enough -- why are we passing the runtime count in the first place? This function can only be called if count == rt_size_class_info[RT_CLASS_4_FULL].fanout. The last parameter to memcpy should evaluate to a compile-time constant, right? Even when we add node shrinking in the future, the constant should be correct, IIUC?
Right. We don't need to pass count to these functions.
- .fanout = 256, + /* technically it's 256, but we can't store that in a uint8, + and this is the max size class so it will never grow */ + .fanout = 0,- Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256)); + Assert(((rt_node *) n256)->fanout == 0); + Assert(chunk_exists || ((rt_node *) n256)->count < 256);These hacks were my work, but I think we can improve that by having two versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized nodes. For that to work, in "init-node" we'd need a branch to set fanout to zero for node256. That should be fine -- it already has to branch for memset'ing node128's indexes to 0xFF.
Since the node has fanout regardless of fixed-sized and
variable-sized, only node256 is the special case where the fanout in
the node doesn't match the actual fanout of the node. I think if we
want to have two versions of NODE_HAS_FREE_SLOT, we can have one for
node256 and one for other classes. Thoughts? In your idea, for
NODE_HAS_FREE_SLOT for fixed-sized nodes, you meant like the
following?
#define FIXED_NODDE_HAS_FREE_SLOT(node, class)
(node->base.n.count < rt_size_class_info[class].fanout)
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Fri, Nov 25, 2022 at 6:47 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
[v11]
There is one more thing that just now occurred to me: In expanding the use of size classes, that makes rebasing and reworking the shared memory piece more work than it should be. That's important because there are still some open questions about the design around shared memory. To keep unnecessary churn to a minimum, perhaps we should limit size class expansion to just one (or 5 total size classes) for the near future?
Make sense. We can add size classes once we have a good design and
implementation around shared memory.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Tue, Nov 29, 2022 at 1:36 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
While creating a benchmark for inserting into node128-inner, I found a bug. If a caller deletes from a node128, the slot index is set to invalid, but the child pointer is still valid. Do that a few times, and every child pointer is valid, even if no slot index points to it. When the next inserter comes along, something surprising happens. This function:
/* Return an unused slot in node-128 */
static int
node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
{
int slotpos = 0;Assert(!NODE_IS_LEAF(node));
while (node_inner_128_is_slot_used(node, slotpos))
slotpos++;return slotpos;
}...passes an integer to this function, whose parameter is a uint8:
/* Is the slot in the node used? */
static inline bool
node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
{
Assert(!NODE_IS_LEAF(node));
return (node->children[slot] != NULL);
}...so instead of growing the node unnecessarily or segfaulting, it enters an infinite loop doing this:
add eax, 1
movzx ecx, al
cmp QWORD PTR [rbx+264+rcx*8], 0
jne .L147The fix is easy enough -- set the child pointer to null upon deletion,
Good catch!
but I'm somewhat astonished that the regression tests didn't hit this. I do still intend to replace this code with something faster, but before I do so the tests should probably exercise the deletion paths more. Since VACUUM
Indeed, there are some tests for deletion but all of them delete all
keys in the node so we end up deleting the node. I've added tests of
repeating deletion and insertion as well as additional assertions.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Wed, Nov 23, 2022 at 2:10 AM Andres Freund <andres@anarazel.de> wrote:
On 2022-11-21 17:06:56 +0900, Masahiko Sawada wrote:
Sure. I've attached the v10 patches. 0004 is the pure refactoring
patch and 0005 patch introduces the pointer tagging.This failed on cfbot, with som many crashes that the VM ran out of disk for
core dumps. During testing with 32bit, so there's probably something broken
around that.https://cirrus-ci.com/task/4635135954386944
A failure is e.g. at: https://api.cirrus-ci.com/v1/artifact/task/4635135954386944/testrun/build-32/testrun/adminpack/regress/log/initdb.log
performing post-bootstrap initialization ... ../src/backend/lib/radixtree.c:1696:21: runtime error: member access within misaligned address 0x590faf74 for type 'struct radix_tree_control', which requires 8 byte alignment
0x590faf74: note: pointer points here
90 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
radix_tree_control struct has two pg_atomic_uint64 variables, and the
assertion check in pg_atomic_init_u64() failed:
static inline void
pg_atomic_init_u64(volatile pg_atomic_uint64 *ptr, uint64 val)
{
/*
* Can't necessarily enforce alignment - and don't need it - when using
* the spinlock based fallback implementation. Therefore only assert when
* not using it.
*/
#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
AssertPointerAlignment(ptr, 8);
#endif
pg_atomic_init_u64_impl(ptr, val);
}
I've investigated this issue and have a question about using atomic
variables on palloc'ed memory. In non-parallel vacuum cases,
radix_tree_control is allocated via aset.c. IIUC in 32-bit machines,
the memory allocated by aset.c is 4-bytes aligned so these atomic
variables are not always 8-bytes aligned. Is there any way to enforce
8-bytes aligned memory allocations in 32-bit machines?
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Wed, Nov 30, 2022 at 11:09 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
I've investigated this issue and have a question about using atomic
variables on palloc'ed memory. In non-parallel vacuum cases,
radix_tree_control is allocated via aset.c. IIUC in 32-bit machines,
the memory allocated by aset.c is 4-bytes aligned so these atomic
variables are not always 8-bytes aligned. Is there any way to enforce
8-bytes aligned memory allocations in 32-bit machines?
The bigger question in my mind is: Why is there an atomic variable in
backend-local memory?
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Nov 30, 2022 at 2:28 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Fri, Nov 25, 2022 at 5:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
These hacks were my work, but I think we can improve that by having two
versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized
nodes. For that to work, in "init-node" we'd need a branch to set fanout to
zero for node256. That should be fine -- it already has to branch for
memset'ing node128's indexes to 0xFF.
Since the node has fanout regardless of fixed-sized and
variable-sized
As currently coded, yes. But that's not strictly necessary, I think.
, only node256 is the special case where the fanout in
the node doesn't match the actual fanout of the node. I think if we
want to have two versions of NODE_HAS_FREE_SLOT, we can have one for
node256 and one for other classes. Thoughts? In your idea, for
NODE_HAS_FREE_SLOT for fixed-sized nodes, you meant like the
following?#define FIXED_NODDE_HAS_FREE_SLOT(node, class)
(node->base.n.count < rt_size_class_info[class].fanout)
Right, and the other one could be VAR_NODE_...
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Dec 1, 2022 at 4:00 PM John Naylor <john.naylor@enterprisedb.com> wrote:
On Wed, Nov 30, 2022 at 11:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've investigated this issue and have a question about using atomic
variables on palloc'ed memory. In non-parallel vacuum cases,
radix_tree_control is allocated via aset.c. IIUC in 32-bit machines,
the memory allocated by aset.c is 4-bytes aligned so these atomic
variables are not always 8-bytes aligned. Is there any way to enforce
8-bytes aligned memory allocations in 32-bit machines?The bigger question in my mind is: Why is there an atomic variable in backend-local memory?
Because I use the same radix_tree and radix_tree_control structs for
non-parallel and parallel vacuum. Therefore, radix_tree_control is
allocated in DSM for parallel-vacuum cases or in backend-local memory
for non-parallel vacuum cases.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Thu, Dec 1, 2022 at 3:03 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Thu, Dec 1, 2022 at 4:00 PM John Naylor <john.naylor@enterprisedb.com>
wrote:
The bigger question in my mind is: Why is there an atomic variable in
backend-local memory?
Because I use the same radix_tree and radix_tree_control structs for
non-parallel and parallel vacuum. Therefore, radix_tree_control is
allocated in DSM for parallel-vacuum cases or in backend-local memory
for non-parallel vacuum cases.
Ok, that could be yet another reason to compile local- and shared-memory
functionality separately, but now I'm wondering why there are atomic
variables at all, since there isn't yet any locking support.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Nov 30, 2022 at 2:51 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
There are a few things up in the air, so I'm coming back to this list to summarize and add a recent update:
On Mon, Nov 14, 2022 at 7:59 PM John Naylor <john.naylor@enterprisedb.com> wrote:
- See how much performance we actually gain from tagging the node kind.
Needs a benchmark that has enough branch mispredicts and L2/3 misses to show a benefit. Otherwise either neutral or worse in its current form, depending on compiler(?). Put off for later.
- Try additional size classes while keeping the node kinds to only four.
This is relatively simple and effective. If only one additional size class (total 5) is coded as a placeholder, I imagine it will be easier to rebase shared memory logic than using this technique everywhere possible.
- Optimize node128 insert.
I've attached a rough start at this. The basic idea is borrowed from our bitmapset nodes, so we can iterate over and operate on word-sized (32- or 64-bit) types at a time, rather than bytes.
Thanks! I think this is a good idea.
To make this easier, I've moved some of the lower-level macros and types from bitmapset.h/.c to pg_bitutils.h. That's probably going to need a separate email thread to resolve the coding style clash this causes, so that can be put off for later.
Agreed. Since tidbitmap.c also has WORDNUM(x) and BITNUM(x), we can
use it if we move from bitmapset.h.
This is not meant to be included in the next patchset. For demonstration purposes, I get these results with a function that repeatedly deletes the last value from a mostly-full node128 leaf and re-inserts it:
select * from bench_node128_load(120);
v11
NOTICE: num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0, n61 = 0, n128 = 121, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
--------+-------+------------------+------------------
120 | 14400 | 208304 | 56v11 + 0006 addendum
NOTICE: num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0, n61 = 0, n128 = 121, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
--------+-------+------------------+------------------
120 | 14400 | 208816 | 34I didn't test inner nodes, but I imagine the difference is bigger. This bitmap style should also be used for the node256-leaf isset array simply to be consistent and avoid needing single-use macros, but that has not been done yet. It won't make a difference for performance because there is no iteration there.
After updating the patch set according to recent comments, I've also
done the same test in my environment and got similar good results.
w/o 0006 addendum patch
NOTICE: num_keys = 14400, height = 1, n4 = 0, n15 = 0, n32 = 0, n125
= 121, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
--------+-------+------------------+------------------
120 | 14400 | 204424 | 29
(1 row)
w/ 0006 addendum patch
NOTICE: num_keys = 14400, height = 1, n4 = 0, n15 = 0, n32 = 0, n125
= 121, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
--------+-------+------------------+------------------
120 | 14400 | 204936 | 18
(1 row)
- Try templating out the differences between local and shared memory.
I hope to start this sometime after the crashes on 32-bit are resolved.
I've attached updated patches that incorporated all comments I got so
far as well as fixes for compiler warnings. I included your bitmapword
patch as 0004 for benchmarking. Also I reverted the change around
pg_atomic_u64 since we don't support any locking as you mentioned and
if we have a single lwlock to protect the radix tree, we don't need to
use pg_atomic_u64 only for max_val and num_keys.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v12-0007-PoC-lazy-vacuum-integration.patchapplication/octet-stream; name=v12-0007-PoC-lazy-vacuum-integration.patchDownload
From e6bce249a60d60ce6ed5eeaf021b5993e7568415 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 4 Nov 2022 14:14:42 +0900
Subject: [PATCH v12 7/7] PoC: lazy vacuum integration.
The patch includes:
* Introducing a new module called TIDStore
* Lazy vacuum and parallel vacuum integration.
TODOs:
* radix tree needs to have the reset funtionality.
* should not allow TIDStore to grow beyond the memory limit.
* change the progress statistics of pg_stat_progress_vacuum.
---
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 448 ++++++++++++++++++++++++++
src/backend/access/heap/vacuumlazy.c | 164 +++-------
src/backend/commands/vacuum.c | 76 +----
src/backend/commands/vacuumparallel.c | 63 ++--
src/backend/storage/lmgr/lwlock.c | 2 +
src/include/access/tidstore.h | 60 ++++
src/include/commands/vacuum.h | 24 +-
src/include/storage/lwlock.h | 1 +
10 files changed, 612 insertions(+), 228 deletions(-)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index 857beaa32d..76265974b1 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -13,6 +13,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..c3cf771f7d
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,448 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * TID (ItemPointer) storage implementation.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "lib/radixtree.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+#include "miscadmin.h"
+
+#define XXX_DEBUG_TID_STORE 1
+
+/* XXX: should be configurable for non-heap AMs */
+#define TIDSTORE_OFFSET_NBITS 11 /* pg_ceil_log2_32(MaxHeapTuplesPerPage) */
+
+#define TIDSTORE_VALUE_NBITS 6 /* log(sizeof(uint64) * BITS_PER_BYTE, 2) */
+
+/* Get block number from the key */
+#define KEY_GET_BLKNO(key) \
+ ((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+struct TIDStore
+{
+ /* main storage for TID */
+ radix_tree *tree;
+
+ /* # of tids in TIDStore */
+ int num_tids;
+
+ /* DSA area and handle for shared TIDStore */
+ rt_handle handle;
+ dsa_area *area;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ ItemPointer itemptrs;
+ uint64 nitems;
+#endif
+};
+
+static void tidstore_iter_collect_tids(TIDStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+/*
+ * Comparator routines for use with qsort() and bsearch().
+ */
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+
+static void
+verify_iter_tids(TIDStoreIter *iter)
+{
+ uint64 index = iter->prev_index;
+
+ if (iter->ts->itemptrs == NULL)
+ return;
+
+ Assert(index <= iter->ts->nitems);
+
+ for (int i = 0; i < iter->num_offsets; i++)
+ {
+ ItemPointerData tid;
+
+ ItemPointerSetBlockNumber(&tid, iter->blkno);
+ ItemPointerSetOffsetNumber(&tid, iter->offsets[i]);
+
+ Assert(ItemPointerEquals(&iter->ts->itemptrs[index++], &tid));
+ }
+
+ iter->prev_index = iter->itemptrs_index;
+}
+
+static void
+dump_itemptrs(TIDStore *ts)
+{
+ StringInfoData buf;
+
+ if (ts->itemptrs == NULL)
+ return;
+
+ initStringInfo(&buf);
+ for (int i = 0; i < ts->nitems; i++)
+ {
+ appendStringInfo(&buf, "(%d,%d) ",
+ ItemPointerGetBlockNumber(&(ts->itemptrs[i])),
+ ItemPointerGetOffsetNumber(&(ts->itemptrs[i])));
+ }
+ elog(WARNING, "--- dump (" UINT64_FORMAT " items) ---", ts->nitems);
+ elog(WARNING, "%s\n", buf.data);
+}
+
+#endif
+
+/*
+ * Create a TIDStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TIDStore *
+tidstore_create(dsa_area *area)
+{
+ TIDStore *ts;
+
+ ts = palloc0(sizeof(TIDStore));
+
+ ts->tree = rt_create(CurrentMemoryContext, area);
+ ts->area = area;
+
+ if (area != NULL)
+ ts->handle = rt_get_handle(ts->tree);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+#define MAXDEADITEMS(avail_mem) \
+ (avail_mem / sizeof(ItemPointerData))
+
+ if (area == NULL)
+ {
+ ts->itemptrs = (ItemPointer) palloc0(sizeof(ItemPointerData) *
+ MAXDEADITEMS(maintenance_work_mem * 1024));
+ ts->nitems = 0;
+ }
+#endif
+
+ return ts;
+}
+
+/* Attach to the shared TIDStore using a handle */
+TIDStore *
+tidstore_attach(dsa_area *area, rt_handle handle)
+{
+ TIDStore *ts;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ ts = palloc0(sizeof(TIDStore));
+ ts->tree = rt_attach(area, handle);
+
+ return ts;
+}
+
+/*
+ * Detach from a TIDStore. This detaches from radix tree and frees the
+ * backend-local resources.
+ */
+void
+tidstore_detach(TIDStore *ts)
+{
+ rt_detach(ts->tree);
+ pfree(ts);
+}
+
+void
+tidstore_free(TIDStore *ts)
+{
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ if (ts->itemptrs)
+ pfree(ts->itemptrs);
+#endif
+
+ rt_free(ts->tree);
+ pfree(ts);
+}
+
+void
+tidstore_reset(TIDStore *ts)
+{
+ dsa_area *area = ts->area;
+
+ /* Reset the statistics */
+ ts->num_tids = 0;
+
+ /* Recreate radix tree storage */
+ rt_free(ts->tree);
+ ts->tree = rt_create(CurrentMemoryContext, area);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ ts->nitems = 0;
+#endif
+}
+
+/* Add TIDs to TIDStore */
+void
+tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 key;
+ uint64 val = 0;
+ ItemPointerData tid;
+
+ ItemPointerSetBlockNumber(&tid, blkno);
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint32 off;
+
+ ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+ key = tid_to_key_off(&tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(ts->tree, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= UINT64CONST(1) << off;
+ ts->num_tids++;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ if (ts->itemptrs)
+ {
+ ItemPointerSetBlockNumber(&(ts->itemptrs[ts->nitems]), blkno);
+ ItemPointerSetOffsetNumber(&(ts->itemptrs[ts->nitems]), offsets[i]);
+ ts->nitems++;
+ }
+#endif
+ }
+
+ if (last_key != PG_UINT64_MAX)
+ {
+ rt_set(ts->tree, last_key, val);
+ val = 0;
+ }
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ if (ts->itemptrs)
+ Assert(ts->nitems == ts->num_tids);
+#endif
+}
+
+/* Return true if the given TID is present in TIDStore */
+bool
+tidstore_lookup_tid(TIDStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val;
+ uint32 off;
+ bool found;
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ bool found_assert;
+#endif
+
+ key = tid_to_key_off(tid, &off);
+
+ found = rt_search(ts->tree, key, &val);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ if (ts->itemptrs)
+ found_assert = bsearch((void *) tid,
+ (void *) ts->itemptrs,
+ ts->nitems,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr) != NULL;
+#endif
+
+ if (!found)
+ {
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ if (ts->itemptrs)
+ Assert(!found_assert);
+#endif
+ return false;
+ }
+
+ found = (val & (UINT64CONST(1) << off)) != 0;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+
+ if (ts->itemptrs && found != found_assert)
+ {
+ elog(WARNING, "tid (%d,%d)\n",
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
+ dump_itemptrs(ts);
+ }
+
+ if (ts->itemptrs)
+ Assert(found == found_assert);
+
+#endif
+ return found;
+}
+
+TIDStoreIter *
+tidstore_begin_iterate(TIDStore *ts)
+{
+ TIDStoreIter *iter;
+
+ iter = palloc0(sizeof(TIDStoreIter));
+ iter->ts = ts;
+ iter->tree_iter = rt_begin_iterate(ts->tree);
+ iter->blkno = InvalidBlockNumber;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ iter->itemptrs_index = 0;
+#endif
+
+ return iter;
+}
+
+bool
+tidstore_iterate_next(TIDStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+
+ if (iter->finished)
+ return false;
+
+ if (BlockNumberIsValid(iter->blkno))
+ {
+ iter->num_offsets = 0;
+ tidstore_iter_collect_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (rt_iterate_next(iter->tree_iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = KEY_GET_BLKNO(key);
+
+ if (BlockNumberIsValid(iter->blkno) && iter->blkno != blkno)
+ {
+ /*
+ * Remember the key-value pair for the next block for the
+ * next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ verify_iter_tids(iter);
+#endif
+ return true;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_collect_tids(iter, key, val);
+ }
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ verify_iter_tids(iter);
+#endif
+
+ iter->finished = true;
+ return true;
+}
+
+uint64
+tidstore_num_tids(TIDStore *ts)
+{
+ return ts->num_tids;
+}
+
+uint64
+tidstore_memory_usage(TIDStore *ts)
+{
+ return (uint64) sizeof(TIDStore) + rt_memory_usage(ts->tree);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TIDStore
+ */
+tidstore_handle
+tidstore_get_handle(TIDStore *ts)
+{
+ return rt_get_handle(ts->tree);
+}
+
+/* Extract TIDs from key-value pair */
+static void
+tidstore_iter_collect_tids(TIDStoreIter *iter, uint64 key, uint64 val)
+{
+ for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ if ((val & (UINT64CONST(1) << i)) == 0)
+ continue;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= i;
+
+ off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+ iter->offsets[iter->num_offsets++] = off;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ iter->itemptrs_index++;
+#endif
+ }
+
+ iter->blkno = KEY_GET_BLKNO(key);
+}
+
+/* Encode a TID to key and val */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint64 tid_i;
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+ *off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+ upper = tid_i >> TIDSTORE_VALUE_NBITS;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ return upper;
+}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d59711b7ec..75dead6c14 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -144,6 +145,8 @@ typedef struct LVRelState
Relation *indrels;
int nindexes;
+ int max_bytes;
+
/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
bool aggressive;
/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -194,7 +197,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TIDStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -265,8 +268,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer *vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer *vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -392,6 +396,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
vacrel->indname = NULL;
vacrel->phase = VACUUM_ERRCB_PHASE_UNKNOWN;
vacrel->verbose = verbose;
+ vacrel->max_bytes = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
errcallback.callback = vacuum_error_callback;
errcallback.arg = vacrel;
errcallback.previous = error_context_stack;
@@ -853,7 +860,7 @@ lazy_scan_heap(LVRelState *vacrel)
next_unskippable_block,
next_failsafe_block = 0,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TIDStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
@@ -867,7 +874,7 @@ lazy_scan_heap(LVRelState *vacrel)
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = vacrel->max_bytes; /* XXX: should use # of tids */
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -937,8 +944,8 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ /* XXX: should not allow tidstore to grow beyond max_bytes */
+ if (tidstore_memory_usage(vacrel->dead_items) > vacrel->max_bytes)
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -1070,11 +1077,17 @@ lazy_scan_heap(LVRelState *vacrel)
if (prunestate.has_lpdead_items)
{
Size freespace;
+ TIDStoreIter *iter;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ tidstore_iterate_next(iter);
+ lazy_vacuum_heap_page(vacrel, blkno, iter->offsets, iter->num_offsets,
+ buf, &vmbuffer);
+ Assert(!tidstore_iterate_next(iter));
+ pfree(iter);
/* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ tidstore_reset(dead_items);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1111,7 +1124,7 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(tidstore_num_tids(dead_items) == 0);
}
/*
@@ -1264,7 +1277,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (tidstore_num_tids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1863,25 +1876,16 @@ retry:
*/
if (lpdead_items > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TIDStore *dead_items = vacrel->dead_items;
Assert(!prunestate->all_visible);
Assert(prunestate->has_lpdead_items);
vacrel->lpdead_item_pages++;
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ tidstore_num_tids(dead_items));
}
/* Finally, add page-local counts to whole-VACUUM counts */
@@ -2088,8 +2092,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TIDStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2098,17 +2101,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- Assert(dead_items->num_items <= dead_items->max_items);
pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ tidstore_num_tids(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2157,7 +2153,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
return;
}
@@ -2186,7 +2182,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2213,8 +2209,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2259,7 +2255,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ /* tidstore_reset(vacrel->dead_items); */
}
/*
@@ -2331,7 +2327,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2368,10 +2364,10 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index;
BlockNumber vacuumed_pages;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TIDStoreIter *iter;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2388,8 +2384,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuumed_pages = 0;
- index = 0;
- while (index < vacrel->dead_items->num_items)
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ while (tidstore_iterate_next(iter))
{
BlockNumber tblk;
Buffer buf;
@@ -2398,12 +2394,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- tblk = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ tblk = iter->blkno;
vacrel->blkno = tblk;
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, tblk, buf, index, &vmbuffer);
+ lazy_vacuum_heap_page(vacrel, tblk, iter->offsets, iter->num_offsets,
+ buf, &vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2427,14 +2424,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2451,11 +2447,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* LP_DEAD item on the page. The return value is the first index immediately
* after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer *vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets, Buffer buffer, Buffer *vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int uncnt = 0;
@@ -2474,16 +2469,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = offsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2563,7 +2553,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -3065,46 +3054,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
return vacrel->nonempty_pages;
}
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
/*
* Allocate dead_items (either using palloc, or in dynamic shared memory).
* Sets dead_items in vacrel for caller.
@@ -3115,12 +3064,6 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
-
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
* be used for an index, so we invoke parallelism only if there are at
@@ -3146,7 +3089,6 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3159,11 +3101,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = tidstore_create(NULL);
}
/*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index a6d5ed1f6b..62db8b0101 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -95,7 +95,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int vac_cmp_itemptr(const void *left, const void *right);
/*
* Primary entry point for manual VACUUM and ANALYZE commands
@@ -2283,16 +2282,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TIDStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ tidstore_num_tids(dead_items))));
return istat;
}
@@ -2323,18 +2322,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
@@ -2345,60 +2332,7 @@ vac_max_items_to_alloc_size(int max_items)
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch((void *) itemptr,
- (void *) dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
-
- return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
- BlockNumber lblk,
- rblk;
- OffsetNumber loff,
- roff;
-
- lblk = ItemPointerGetBlockNumber((ItemPointer) left);
- rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
- if (lblk < rblk)
- return -1;
- if (lblk > rblk)
- return 1;
-
- loff = ItemPointerGetOffsetNumber((ItemPointer) left);
- roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
- if (loff < roff)
- return -1;
- if (loff > roff)
- return 1;
+ TIDStore *dead_items = (TIDStore *) state;
- return 0;
+ return tidstore_lookup_tid(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index f26d796e52..742039b3a6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
* use small integers.
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 3
#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 4
#define PARALLEL_VACUUM_KEY_WAL_USAGE 5
@@ -103,6 +103,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TIDStore */
+ tidstore_handle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
+ dsa_area *dead_items_area;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -222,20 +226,22 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -283,9 +289,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -351,6 +356,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = tidstore_create(dead_items_dsa);
+ pvs->dead_items = dead_items;
+ pvs->dead_items_area = dead_items_dsa;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -360,6 +375,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = tidstore_get_handle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +384,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -434,6 +441,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ tidstore_free(pvs->dead_items);
+ dsa_detach(pvs->dead_items_area);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -442,7 +452,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TIDStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -940,7 +950,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -984,10 +996,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1045,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ tidstore_detach(pvs.dead_items);
+ dsa_detach(dead_items_area);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a5ad36ca78..2fb30fe2e7 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -183,6 +183,8 @@ static const char *const BuiltinTrancheNames[] = {
"PgStatsHash",
/* LWTRANCHE_PGSTATS_DATA: */
"PgStatsData",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..f4ccf1dbc5
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,60 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * TID storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "lib/radixtree.h"
+#include "storage/itemptr.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TIDStore TIDStore;
+
+typedef struct TIDStoreIter
+{
+ TIDStore *ts;
+
+ rt_iter *tree_iter;
+
+ bool finished;
+
+ uint64 next_key;
+ uint64 next_val;
+
+ BlockNumber blkno;
+ OffsetNumber offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+ int num_offsets;
+
+#ifdef USE_ASSERT_CHECKING
+ uint64 itemptrs_index;
+ int prev_index;
+#endif
+} TIDStoreIter;
+
+extern TIDStore *tidstore_create(dsa_area *dsa);
+extern TIDStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TIDStore *ts);
+extern void tidstore_free(TIDStore *ts);
+extern void tidstore_reset(TIDStore *ts);
+extern void tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TIDStore *ts, ItemPointer tid);
+extern TIDStoreIter * tidstore_begin_iterate(TIDStore *ts);
+extern bool tidstore_iterate_next(TIDStoreIter *iter);
+extern uint64 tidstore_num_tids(TIDStore *ts);
+extern uint64 tidstore_memory_usage(TIDStore *ts);
+extern tidstore_handle tidstore_get_handle(TIDStore *ts);
+
+#endif /* TIDSTORE_H */
+
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 4e4bc26a8b..c15e6d7a66 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -235,21 +236,6 @@ typedef struct VacuumParams
int nworkers;
} VacuumParams;
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -302,18 +288,16 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TIDStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TIDStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index a494cb598f..88e35254d1 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -201,6 +201,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DSA,
LWTRANCHE_PGSTATS_HASH,
LWTRANCHE_PGSTATS_DATA,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
--
2.31.1
v12-0005-Use-rt_node_ptr-to-reference-radix-tree-nodes.patchapplication/octet-stream; name=v12-0005-Use-rt_node_ptr-to-reference-radix-tree-nodes.patchDownload
From f9bc757064a1dcbcfb98f9df2a497b510252c0d2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 14 Nov 2022 11:44:17 +0900
Subject: [PATCH v12 5/7] Use rt_node_ptr to reference radix tree nodes.
---
src/backend/lib/radixtree.c | 688 +++++++++++++++++++++---------------
1 file changed, 398 insertions(+), 290 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 673cc5e46b..a97d86ae2b 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -145,6 +145,19 @@ typedef enum rt_size_class
#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
} rt_size_class;
+/*
+ * rt_pointer is a pointer compatible with a pointer to local memory and a
+ * pointer for DSA area (i.e. dsa_pointer). Since the radix tree node can be
+ * allocated in backend local memory as well as DSA area, we cannot use a
+ * C-pointer to rt_node (i.e. backend local memory address) for child pointers
+ * in inner nodes. Inner nodes need to use rt_pointer instead. We can get
+ * the backend local memory address of a node from a rt_pointer by using
+ * rt_pointer_decode().
+*/
+typedef uintptr_t rt_pointer;
+#define InvalidRTPointer ((rt_pointer) 0)
+#define RTPointerIsValid(x) (((rt_pointer) (x)) != InvalidRTPointer)
+
/* Common type for all nodes types */
typedef struct rt_node
{
@@ -170,8 +183,7 @@ typedef struct rt_node
/* Node kind, one per search/set algorithm */
uint8 kind;
} rt_node;
-#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
-#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
+#define RT_NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
#define VAR_NODE_HAS_FREE_SLOT(node) \
((node)->base.n.count < (node)->base.n.fanout)
#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
@@ -235,7 +247,7 @@ typedef struct rt_node_inner_4
rt_node_base_4 base;
/* number of children depends on size class */
- rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+ rt_pointer children[FLEXIBLE_ARRAY_MEMBER];
} rt_node_inner_4;
typedef struct rt_node_leaf_4
@@ -251,7 +263,7 @@ typedef struct rt_node_inner_32
rt_node_base_32 base;
/* number of children depends on size class */
- rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+ rt_pointer children[FLEXIBLE_ARRAY_MEMBER];
} rt_node_inner_32;
typedef struct rt_node_leaf_32
@@ -267,7 +279,7 @@ typedef struct rt_node_inner_125
rt_node_base_125 base;
/* number of children depends on size class */
- rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+ rt_pointer children[FLEXIBLE_ARRAY_MEMBER];
} rt_node_inner_125;
typedef struct rt_node_leaf_125
@@ -287,7 +299,7 @@ typedef struct rt_node_inner_256
rt_node_base_256 base;
/* Slots for 256 children */
- rt_node *children[RT_NODE_MAX_SLOTS];
+ rt_pointer children[RT_NODE_MAX_SLOTS];
} rt_node_inner_256;
typedef struct rt_node_leaf_256
@@ -301,6 +313,29 @@ typedef struct rt_node_leaf_256
uint64 values[RT_NODE_MAX_SLOTS];
} rt_node_leaf_256;
+/* rt_node_ptr is a data structure representing a pointer for a rt_node */
+typedef struct rt_node_ptr
+{
+ rt_pointer encoded;
+ rt_node *decoded;
+} rt_node_ptr;
+#define InvalidRTNodePtr \
+ (rt_node_ptr) {.encoded = InvalidRTPointer, .decoded = NULL}
+#define RTNodePtrIsValid(n) \
+ (!rt_node_ptr_eq((rt_node_ptr *) &(n), &(InvalidRTNodePtr)))
+
+/* Macros for rt_node_ptr to access the fields of rt_node */
+#define NODE_RAW(n) (n.decoded)
+#define NODE_IS_LEAF(n) (NODE_RAW(n)->shift == 0)
+#define NODE_IS_EMPTY(n) (NODE_COUNT(n) == 0)
+#define NODE_KIND(n) (NODE_RAW(n)->kind)
+#define NODE_COUNT(n) (NODE_RAW(n)->count)
+#define NODE_SHIFT(n) (NODE_RAW(n)->shift)
+#define NODE_CHUNK(n) (NODE_RAW(n)->chunk)
+#define NODE_FANOUT(n) (NODE_RAW(n)->fanout)
+#define NODE_HAS_FREE_SLOT(n) \
+ (NODE_COUNT(n) < rt_node_kind_info[NODE_KIND(n)].fanout)
+
/* Information for each size class */
typedef struct rt_size_class_elem
{
@@ -389,7 +424,7 @@ static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
*/
typedef struct rt_node_iter
{
- rt_node *node; /* current node being iterated */
+ rt_node_ptr node; /* current node being iterated */
int current_idx; /* current position. -1 for initial value */
} rt_node_iter;
@@ -410,7 +445,7 @@ struct radix_tree
{
MemoryContext context;
- rt_node *root;
+ rt_pointer root;
uint64 max_val;
uint64 num_keys;
@@ -424,27 +459,58 @@ struct radix_tree
};
static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
-static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+
+static rt_node_ptr rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node_ptr node, uint8 kind, rt_size_class size_class,
bool inner);
-static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_free_node(radix_tree *tree, rt_node_ptr node);
static void rt_extend(radix_tree *tree, uint64 key);
-static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
- rt_node **child_p);
-static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+static inline bool rt_node_search_inner(rt_node_ptr node_ptr, uint64 key, rt_action action,
+ rt_pointer *child_p);
+static inline bool rt_node_search_leaf(rt_node_ptr node_ptr, uint64 key, rt_action action,
uint64 *value_p);
-static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
- uint64 key, rt_node *child);
-static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+static bool rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+ uint64 key, rt_node_ptr child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
uint64 key, uint64 value);
-static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ rt_node_ptr *child_p);
static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
uint64 *value_p);
-static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static void rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from);
static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
/* verification (available only with assertion) */
-static void rt_verify_node(rt_node *node);
+static void rt_verify_node(rt_node_ptr node);
+
+/* Decode and encode functions of rt_pointer */
+static inline rt_node *
+rt_pointer_decode(rt_pointer encoded)
+{
+ return (rt_node *) encoded;
+}
+
+static inline rt_pointer
+rt_pointer_encode(rt_node *decoded)
+{
+ return (rt_pointer) decoded;
+}
+
+/* Return a rt_node_ptr created from the given encoded pointer */
+static inline rt_node_ptr
+rt_node_ptr_encoded(rt_pointer encoded)
+{
+ return (rt_node_ptr) {
+ .encoded = encoded,
+ .decoded = rt_pointer_decode(encoded),
+ };
+}
+
+static inline bool
+rt_node_ptr_eq(rt_node_ptr *a, rt_node_ptr *b)
+{
+ return (a->decoded == b->decoded) && (a->encoded == b->encoded);
+}
/*
* Return index of the first element in 'base' that equals 'key'. Return -1
@@ -593,10 +659,10 @@ node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
/* Shift the elements right at 'idx' by one */
static inline void
-chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_shift(uint8 *chunks, rt_pointer *children, int count, int idx)
{
memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
- memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_pointer) * (count - idx));
}
static inline void
@@ -608,10 +674,10 @@ chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
/* Delete the element at 'idx' */
static inline void
-chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_delete(uint8 *chunks, rt_pointer *children, int count, int idx)
{
memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
- memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_pointer) * (count - idx - 1));
}
static inline void
@@ -623,12 +689,12 @@ chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
/* Copy both chunks and children/values arrays */
static inline void
-chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
- uint8 *dst_chunks, rt_node **dst_children)
+chunk_children_array_copy(uint8 *src_chunks, rt_pointer *src_children,
+ uint8 *dst_chunks, rt_pointer *dst_children)
{
const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
- const Size children_size = sizeof(rt_node *) * fanout;
+ const Size children_size = sizeof(rt_pointer) * fanout;
memcpy(dst_chunks, src_chunks, chunk_size);
memcpy(dst_children, src_children, children_size);
@@ -660,7 +726,7 @@ node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
static inline bool
node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
Assert(slot < node->base.n.fanout);
return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
}
@@ -668,23 +734,23 @@ node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
static inline bool
node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
Assert(slot < node->base.n.fanout);
return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
}
#endif
-static inline rt_node *
+static inline rt_pointer
node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
return node->children[node->base.slot_idxs[chunk]];
}
static inline uint64
node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
return node->values[node->base.slot_idxs[chunk]];
}
@@ -694,9 +760,9 @@ node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
{
int slotpos = node->base.slot_idxs[chunk];
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
- node->children[node->base.slot_idxs[chunk]] = NULL;
+ node->children[node->base.slot_idxs[chunk]] = InvalidRTPointer;
node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
}
@@ -705,7 +771,7 @@ node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
{
int slotpos = node->base.slot_idxs[chunk];
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
}
@@ -737,11 +803,11 @@ node_125_find_unused_slot(bitmapword *isset)
}
static inline void
-node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_pointer child)
{
int slotpos;
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
slotpos = node_125_find_unused_slot(node->base.isset);
Assert(slotpos < node->base.n.fanout);
@@ -756,7 +822,7 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
{
int slotpos;
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
slotpos = node_125_find_unused_slot(node->base.isset);
Assert(slotpos < node->base.n.fanout);
@@ -767,16 +833,16 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
/* Update the child corresponding to 'chunk' to 'child' */
static inline void
-node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_pointer child)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->children[node->base.slot_idxs[chunk]] = child;
}
static inline void
node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->values[node->base.slot_idxs[chunk]] = value;
}
@@ -786,21 +852,21 @@ node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
static inline bool
node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
- return (node->children[chunk] != NULL);
+ Assert(!RT_NODE_IS_LEAF(node));
+ return RTPointerIsValid(node->children[chunk]);
}
static inline bool
node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
}
-static inline rt_node *
+static inline rt_pointer
node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
Assert(node_inner_256_is_chunk_used(node, chunk));
return node->children[chunk];
}
@@ -808,16 +874,16 @@ node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
static inline uint64
node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
Assert(node_leaf_256_is_chunk_used(node, chunk));
return node->values[chunk];
}
/* Set the child in the node-256 */
static inline void
-node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_pointer child)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->children[chunk] = child;
}
@@ -825,7 +891,7 @@ node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
static inline void
node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
node->values[chunk] = value;
}
@@ -834,14 +900,14 @@ node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
static inline void
node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
- node->children[chunk] = NULL;
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = InvalidRTPointer;
}
static inline void
node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
}
@@ -877,29 +943,32 @@ rt_new_root(radix_tree *tree, uint64 key)
{
int shift = key_get_shift(key);
bool inner = shift > 0;
- rt_node *newnode;
+ rt_node_ptr newnode;
newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
- newnode->shift = shift;
+ NODE_SHIFT(newnode) = shift;
+
tree->max_val = shift_get_max_val(shift);
- tree->root = newnode;
+ tree->root = newnode.encoded;
}
/*
* Allocate a new node with the given node kind.
*/
-static rt_node *
+static rt_node_ptr
rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
{
- rt_node *newnode;
+ rt_node_ptr newnode;
if (inner)
- newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
- rt_size_class_info[size_class].inner_size);
+ newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+ rt_size_class_info[size_class].inner_size);
else
- newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
- rt_size_class_info[size_class].leaf_size);
+ newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ rt_size_class_info[size_class].leaf_size);
+
+ newnode.encoded = rt_pointer_encode(newnode.decoded);
#ifdef RT_DEBUG
/* update the statistics */
@@ -911,20 +980,20 @@ rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
/* Initialize the node contents */
static inline void
-rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+rt_init_node(rt_node_ptr node, uint8 kind, rt_size_class size_class, bool inner)
{
if (inner)
- MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+ MemSet(node.decoded, 0, rt_size_class_info[size_class].inner_size);
else
- MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+ MemSet(node.decoded, 0, rt_size_class_info[size_class].leaf_size);
- node->kind = kind;
- node->fanout = rt_size_class_info[size_class].fanout;
+ NODE_KIND(node) = kind;
+ NODE_FANOUT(node) = rt_size_class_info[size_class].fanout;
/* Initialize slot_idxs to invalid values */
if (kind == RT_NODE_KIND_125)
{
- rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node.decoded;
memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
}
@@ -934,25 +1003,25 @@ rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
* and this is the max size class to it will never grow.
*/
if (kind == RT_NODE_KIND_256)
- node->fanout = 0;
+ NODE_FANOUT(node) = 0;
}
static inline void
-rt_copy_node(rt_node *newnode, rt_node *oldnode)
+rt_copy_node(rt_node_ptr newnode, rt_node_ptr oldnode)
{
- newnode->shift = oldnode->shift;
- newnode->chunk = oldnode->chunk;
- newnode->count = oldnode->count;
+ NODE_SHIFT(newnode) = NODE_SHIFT(oldnode);
+ NODE_CHUNK(newnode) = NODE_CHUNK(oldnode);
+ NODE_COUNT(newnode) = NODE_COUNT(oldnode);
}
/*
* Create a new node with 'new_kind' and the same shift, chunk, and
* count of 'node'.
*/
-static rt_node*
-rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+static rt_node_ptr
+rt_grow_node_kind(radix_tree *tree, rt_node_ptr node, uint8 new_kind)
{
- rt_node *newnode;
+ rt_node_ptr newnode;
bool inner = !NODE_IS_LEAF(node);
newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
@@ -964,12 +1033,12 @@ rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
/* Free the given node */
static void
-rt_free_node(radix_tree *tree, rt_node *node)
+rt_free_node(radix_tree *tree, rt_node_ptr node)
{
/* If we're deleting the root node, make the tree empty */
- if (tree->root == node)
+ if (tree->root == node.encoded)
{
- tree->root = NULL;
+ tree->root = InvalidRTPointer;
tree->max_val = 0;
}
@@ -980,7 +1049,7 @@ rt_free_node(radix_tree *tree, rt_node *node)
/* update the statistics */
for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
- if (node->fanout == rt_size_class_info[i].fanout)
+ if (NODE_FANOUT(node) == rt_size_class_info[i].fanout)
break;
}
@@ -993,29 +1062,30 @@ rt_free_node(radix_tree *tree, rt_node *node)
}
#endif
- pfree(node);
+ pfree(node.decoded);
}
/*
* Replace old_child with new_child, and free the old one.
*/
static void
-rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
- rt_node *new_child, uint64 key)
+rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
+ rt_node_ptr new_child, uint64 key)
{
- Assert(old_child->chunk == new_child->chunk);
- Assert(old_child->shift == new_child->shift);
+ Assert(NODE_CHUNK(old_child) == NODE_CHUNK(new_child));
+ Assert(NODE_SHIFT(old_child) == NODE_SHIFT(new_child));
- if (parent == old_child)
+ if (rt_node_ptr_eq(&parent, &old_child))
{
/* Replace the root node with the new large node */
- tree->root = new_child;
+ tree->root = new_child.encoded;
}
else
{
bool replaced PG_USED_FOR_ASSERTS_ONLY;
- replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+ replaced = rt_node_insert_inner(tree, InvalidRTNodePtr, parent, key,
+ new_child);
Assert(replaced);
}
@@ -1030,24 +1100,28 @@ static void
rt_extend(radix_tree *tree, uint64 key)
{
int target_shift;
- int shift = tree->root->shift + RT_NODE_SPAN;
+ rt_node *root = rt_pointer_decode(tree->root);
+ int shift = root->shift + RT_NODE_SPAN;
target_shift = key_get_shift(key);
/* Grow tree from 'shift' to 'target_shift' */
while (shift <= target_shift)
{
- rt_node_inner_4 *node;
+ rt_node_ptr node;
+ rt_node_inner_4 *n4;
+
+ node = rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+ rt_init_node(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
- node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
- rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
- node->base.n.shift = shift;
- node->base.n.count = 1;
- node->base.chunks[0] = 0;
- node->children[0] = tree->root;
+ n4 = (rt_node_inner_4 *) node.decoded;
+ n4->base.n.shift = shift;
+ n4->base.n.count = 1;
+ n4->base.chunks[0] = 0;
+ n4->children[0] = tree->root;
- tree->root->chunk = 0;
- tree->root = (rt_node *) node;
+ root->chunk = 0;
+ tree->root = node.encoded;
shift += RT_NODE_SPAN;
}
@@ -1060,21 +1134,22 @@ rt_extend(radix_tree *tree, uint64 key)
* Insert inner and leaf nodes from 'node' to bottom.
*/
static inline void
-rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
- rt_node *node)
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
+ rt_node_ptr node)
{
- int shift = node->shift;
+ int shift = NODE_SHIFT(node);
while (shift >= RT_NODE_SPAN)
{
- rt_node *newchild;
+ rt_node_ptr newchild;
int newshift = shift - RT_NODE_SPAN;
bool inner = newshift > 0;
newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
- newchild->shift = newshift;
- newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ NODE_SHIFT(newchild) = newshift;
+ NODE_CHUNK(newchild) = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
+
rt_node_insert_inner(tree, parent, node, key, newchild);
parent = node;
@@ -1094,17 +1169,18 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
* pointer is set to child_p.
*/
static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
+ rt_pointer *child_p)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool found = false;
- rt_node *child = NULL;
+ rt_pointer child;
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
if (idx < 0)
@@ -1122,7 +1198,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
}
case RT_NODE_KIND_32:
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
if (idx < 0)
@@ -1138,7 +1214,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
}
case RT_NODE_KIND_125:
{
- rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
break;
@@ -1154,7 +1230,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
}
case RT_NODE_KIND_256:
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
if (!node_inner_256_is_chunk_used(n256, chunk))
break;
@@ -1171,7 +1247,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
/* update statistics */
if (action == RT_ACTION_DELETE && found)
- node->count--;
+ NODE_COUNT(node)--;
if (found && child_p)
*child_p = child;
@@ -1187,17 +1263,17 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
* to the value is set to value_p.
*/
static inline bool
-rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+rt_node_search_leaf(rt_node_ptr node, uint64 key, rt_action action, uint64 *value_p)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool found = false;
uint64 value = 0;
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
if (idx < 0)
@@ -1215,7 +1291,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
}
case RT_NODE_KIND_32:
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
if (idx < 0)
@@ -1231,7 +1307,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
}
case RT_NODE_KIND_125:
{
- rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
break;
@@ -1247,7 +1323,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
}
case RT_NODE_KIND_256:
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
if (!node_leaf_256_is_chunk_used(n256, chunk))
break;
@@ -1264,7 +1340,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
/* update statistics */
if (action == RT_ACTION_DELETE && found)
- node->count--;
+ NODE_COUNT(node)--;
if (found && value_p)
*value_p = value;
@@ -1274,19 +1350,19 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
/* Insert the child to the inner node */
static bool
-rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
- rt_node *child)
+rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+ uint64 key, rt_node_ptr child)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool chunk_exists = false;
Assert(!NODE_IS_LEAF(node));
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
int idx;
idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1294,25 +1370,27 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
{
/* found the existing chunk */
chunk_exists = true;
- n4->children[idx] = child;
+ n4->children[idx] = child.encoded;
break;
}
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
{
+ rt_node_ptr new;
rt_node_inner_32 *new32;
- Assert(parent != NULL);
+
+ Assert(RTNodePtrIsValid(parent));
/* grow node from 4 to 32 */
- new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+ new32 = (rt_node_inner_32 *) new.decoded;
+
chunk_children_array_copy(n4->base.chunks, n4->children,
new32->base.chunks, new32->children);
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
- key);
- node = (rt_node *) new32;
+ Assert(RTNodePtrIsValid(parent));
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1325,14 +1403,14 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
count, insertpos);
n4->base.chunks[insertpos] = chunk;
- n4->children[insertpos] = child;
+ n4->children[insertpos] = child.encoded;
break;
}
}
/* FALLTHROUGH */
case RT_NODE_KIND_32:
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
int idx;
idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1340,45 +1418,52 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
{
/* found the existing chunk */
chunk_exists = true;
- n32->children[idx] = child;
+ n32->children[idx] = child.encoded;
break;
}
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
{
- Assert(parent != NULL);
+ Assert(RTNodePtrIsValid(parent));
if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
{
/* use the same node kind, but expand to the next size class */
const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size;
const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+ rt_node_ptr new;
rt_node_inner_32 *new32;
- new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+ new = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+ new32 = (rt_node_inner_32 *) new.decoded;
memcpy(new32, n32, size);
new32->base.n.fanout = fanout;
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+ rt_replace_node(tree, parent, node, new, key);
- /* must update both pointers here */
- node = (rt_node *) new32;
+ /*
+ * Must update both pointers here since we update n32 and
+ * verify node.
+ */
+ node = new;
n32 = new32;
goto retry_insert_inner_32;
}
else
{
+ rt_node_ptr new;
rt_node_inner_125 *new125;
/* grow node from 32 to 125 */
- new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
- RT_NODE_KIND_125);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+ new125 = (rt_node_inner_125 *) new.decoded;
+
for (int i = 0; i < n32->base.n.count; i++)
node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
- node = (rt_node *) new125;
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
}
else
@@ -1393,7 +1478,7 @@ retry_insert_inner_32:
count, insertpos);
n32->base.chunks[insertpos] = chunk;
- n32->children[insertpos] = child;
+ n32->children[insertpos] = child.encoded;
break;
}
}
@@ -1401,25 +1486,28 @@ retry_insert_inner_32:
/* FALLTHROUGH */
case RT_NODE_KIND_125:
{
- rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
int cnt = 0;
if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
{
/* found the existing chunk */
chunk_exists = true;
- node_inner_125_update(n125, chunk, child);
+ node_inner_125_update(n125, chunk, child.encoded);
break;
}
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
{
+ rt_node_ptr new;
rt_node_inner_256 *new256;
- Assert(parent != NULL);
+
+ Assert(RTNodePtrIsValid(parent));
/* grow node from 125 to 256 */
- new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
- RT_NODE_KIND_256);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+ new256 = (rt_node_inner_256 *) new.decoded;
+
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
{
if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
@@ -1429,32 +1517,31 @@ retry_insert_inner_32:
cnt++;
}
- rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
- key);
- node = (rt_node *) new256;
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
- node_inner_125_insert(n125, chunk, child);
+ node_inner_125_insert(n125, chunk, child.encoded);
break;
}
}
/* FALLTHROUGH */
case RT_NODE_KIND_256:
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
- node_inner_256_set(n256, chunk, child);
+ node_inner_256_set(n256, chunk, child.encoded);
break;
}
}
/* Update statistics */
if (!chunk_exists)
- node->count++;
+ NODE_COUNT(node)++;
/*
* Done. Finally, verify the chunk and value is inserted or replaced
@@ -1467,19 +1554,19 @@ retry_insert_inner_32:
/* Insert the value to the leaf node */
static bool
-rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
uint64 key, uint64 value)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool chunk_exists = false;
Assert(NODE_IS_LEAF(node));
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
int idx;
idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1493,16 +1580,18 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
{
+ rt_node_ptr new;
rt_node_leaf_32 *new32;
- Assert(parent != NULL);
+
+ Assert(RTNodePtrIsValid(parent));
/* grow node from 4 to 32 */
- new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+ new32 = (rt_node_leaf_32 *) new.decoded;
chunk_values_array_copy(n4->base.chunks, n4->values,
new32->base.chunks, new32->values);
- rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
- node = (rt_node *) new32;
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1522,7 +1611,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* FALLTHROUGH */
case RT_NODE_KIND_32:
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
int idx;
idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1536,45 +1625,51 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
{
- Assert(parent != NULL);
+ Assert(RTNodePtrIsValid(parent));
if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
{
/* use the same node kind, but expand to the next size class */
const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+ rt_node_ptr new;
rt_node_leaf_32 *new32;
- new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+ new = rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+ new32 = (rt_node_leaf_32 *) new.decoded;
memcpy(new32, n32, size);
new32->base.n.fanout = fanout;
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+ rt_replace_node(tree, parent, node, new, key);
- /* must update both pointers here */
- node = (rt_node *) new32;
+ /*
+ * Must update both pointers here since we update n32 and
+ * verify node.
+ */
+ node = new;
n32 = new32;
goto retry_insert_leaf_32;
}
else
{
+ rt_node_ptr new;
rt_node_leaf_125 *new125;
/* grow node from 32 to 125 */
- new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
- RT_NODE_KIND_125);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+ new125 = (rt_node_leaf_125 *) new.decoded;
+
for (int i = 0; i < n32->base.n.count; i++)
node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
- key);
- node = (rt_node *) new125;
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
}
else
{
- retry_insert_leaf_32:
+retry_insert_leaf_32:
{
int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
int count = n32->base.n.count;
@@ -1592,7 +1687,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* FALLTHROUGH */
case RT_NODE_KIND_125:
{
- rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
int cnt = 0;
if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
@@ -1605,12 +1700,14 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
{
+ rt_node_ptr new;
rt_node_leaf_256 *new256;
- Assert(parent != NULL);
+
+ Assert(RTNodePtrIsValid(parent));
/* grow node from 125 to 256 */
- new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
- RT_NODE_KIND_256);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+ new256 = (rt_node_leaf_256 *) new.decoded;
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
{
if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
@@ -1620,9 +1717,8 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
cnt++;
}
- rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
- key);
- node = (rt_node *) new256;
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1633,7 +1729,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* FALLTHROUGH */
case RT_NODE_KIND_256:
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
@@ -1645,7 +1741,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* Update statistics */
if (!chunk_exists)
- node->count++;
+ NODE_COUNT(node)++;
/*
* Done. Finally, verify the chunk and value is inserted or replaced
@@ -1669,7 +1765,7 @@ rt_create(MemoryContext ctx)
tree = palloc(sizeof(radix_tree));
tree->context = ctx;
- tree->root = NULL;
+ tree->root = InvalidRTPointer;
tree->max_val = 0;
tree->num_keys = 0;
@@ -1718,26 +1814,23 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
{
int shift;
bool updated;
- rt_node *node;
- rt_node *parent;
+ rt_node_ptr node;
+ rt_node_ptr parent;
/* Empty tree, create the root */
- if (!tree->root)
+ if (!RTPointerIsValid(tree->root))
rt_new_root(tree, key);
/* Extend the tree if necessary */
if (key > tree->max_val)
rt_extend(tree, key);
- Assert(tree->root);
-
- shift = tree->root->shift;
- node = parent = tree->root;
-
/* Descend the tree until a leaf node */
+ node = parent = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
while (shift >= 0)
{
- rt_node *child;
+ rt_pointer child;
if (NODE_IS_LEAF(node))
break;
@@ -1749,7 +1842,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
}
parent = node;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
}
@@ -1770,21 +1863,21 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
bool
rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
{
- rt_node *node;
+ rt_node_ptr node;
int shift;
Assert(value_p != NULL);
- if (!tree->root || key > tree->max_val)
+ if (!RTPointerIsValid(tree->root) || key > tree->max_val)
return false;
- node = tree->root;
- shift = tree->root->shift;
+ node = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
/* Descend the tree until a leaf node */
while (shift >= 0)
{
- rt_node *child;
+ rt_pointer child;
if (NODE_IS_LEAF(node))
break;
@@ -1792,7 +1885,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
}
@@ -1806,8 +1899,8 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
bool
rt_delete(radix_tree *tree, uint64 key)
{
- rt_node *node;
- rt_node *stack[RT_MAX_LEVEL] = {0};
+ rt_node_ptr node;
+ rt_node_ptr stack[RT_MAX_LEVEL] = {0};
int shift;
int level;
bool deleted;
@@ -1819,12 +1912,12 @@ rt_delete(radix_tree *tree, uint64 key)
* Descend the tree to search the key while building a stack of nodes we
* visited.
*/
- node = tree->root;
- shift = tree->root->shift;
+ node = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
level = -1;
while (shift > 0)
{
- rt_node *child;
+ rt_pointer child;
/* Push the current node to the stack */
stack[++level] = node;
@@ -1832,7 +1925,7 @@ rt_delete(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
}
@@ -1883,6 +1976,7 @@ rt_iter *
rt_begin_iterate(radix_tree *tree)
{
MemoryContext old_ctx;
+ rt_node_ptr root;
rt_iter *iter;
int top_level;
@@ -1892,17 +1986,18 @@ rt_begin_iterate(radix_tree *tree)
iter->tree = tree;
/* empty tree */
- if (!iter->tree->root)
+ if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->root))
return iter;
- top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ root = rt_node_ptr_encoded(iter->tree->root);
+ top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
iter->stack_len = top_level;
/*
* Descend to the left most leaf node from the root. The key is being
* constructed while descending to the leaf.
*/
- rt_update_iter_stack(iter, iter->tree->root, top_level);
+ rt_update_iter_stack(iter, root, top_level);
MemoryContextSwitchTo(old_ctx);
@@ -1913,14 +2008,15 @@ rt_begin_iterate(radix_tree *tree)
* Update each node_iter for inner nodes in the iterator node stack.
*/
static void
-rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
{
int level = from;
- rt_node *node = from_node;
+ rt_node_ptr node = from_node;
for (;;)
{
rt_node_iter *node_iter = &(iter->stack[level--]);
+ bool found PG_USED_FOR_ASSERTS_ONLY;
node_iter->node = node;
node_iter->current_idx = -1;
@@ -1930,10 +2026,10 @@ rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
return;
/* Advance to the next slot in the inner node */
- node = rt_node_inner_iterate_next(iter, node_iter);
+ found = rt_node_inner_iterate_next(iter, node_iter, &node);
/* We must find the first children in the node */
- Assert(node);
+ Assert(found);
}
}
@@ -1950,7 +2046,7 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
for (;;)
{
- rt_node *child = NULL;
+ rt_node_ptr child = InvalidRTNodePtr;
uint64 value;
int level;
bool found;
@@ -1971,14 +2067,12 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
*/
for (level = 1; level <= iter->stack_len; level++)
{
- child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
-
- if (child)
+ if (rt_node_inner_iterate_next(iter, &(iter->stack[level]), &child))
break;
}
/* the iteration finished */
- if (!child)
+ if (!RTNodePtrIsValid(child))
return false;
/*
@@ -2010,18 +2104,19 @@ rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
* Advance the slot in the inner node. Return the child if exists, otherwise
* null.
*/
-static inline rt_node *
-rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+static inline bool
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *child_p)
{
- rt_node *child = NULL;
+ rt_node_ptr node = node_iter->node;
+ rt_pointer child;
bool found = false;
uint8 key_chunk;
- switch (node_iter->node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n4->base.n.count)
@@ -2034,7 +2129,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
case RT_NODE_KIND_32:
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n32->base.n.count)
@@ -2047,7 +2142,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
case RT_NODE_KIND_125:
{
- rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2067,7 +2162,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
case RT_NODE_KIND_256:
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2088,9 +2183,12 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
if (found)
- rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ {
+ rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
+ *child_p = rt_node_ptr_encoded(child);
+ }
- return child;
+ return found;
}
/*
@@ -2098,19 +2196,18 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
* is set to value_p, otherwise return false.
*/
static inline bool
-rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
- uint64 *value_p)
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_p)
{
- rt_node *node = node_iter->node;
+ rt_node_ptr node = node_iter->node;
bool found = false;
uint64 value;
uint8 key_chunk;
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n4->base.n.count)
@@ -2123,7 +2220,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
}
case RT_NODE_KIND_32:
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n32->base.n.count)
@@ -2136,7 +2233,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
}
case RT_NODE_KIND_125:
{
- rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2156,7 +2253,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
}
case RT_NODE_KIND_256:
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2178,7 +2275,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
if (found)
{
- rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
*value_p = value;
}
@@ -2215,16 +2312,16 @@ rt_memory_usage(radix_tree *tree)
* Verify the radix tree node.
*/
static void
-rt_verify_node(rt_node *node)
+rt_verify_node(rt_node_ptr node)
{
#ifdef USE_ASSERT_CHECKING
- Assert(node->count >= 0);
+ Assert(NODE_COUNT(node) >= 0);
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+ rt_node_base_4 *n4 = (rt_node_base_4 *) node.decoded;
for (int i = 1; i < n4->n.count; i++)
Assert(n4->chunks[i - 1] < n4->chunks[i]);
@@ -2233,7 +2330,7 @@ rt_verify_node(rt_node *node)
}
case RT_NODE_KIND_32:
{
- rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+ rt_node_base_32 *n32 = (rt_node_base_32 *) node.decoded;
for (int i = 1; i < n32->n.count; i++)
Assert(n32->chunks[i - 1] < n32->chunks[i]);
@@ -2242,7 +2339,7 @@ rt_verify_node(rt_node *node)
}
case RT_NODE_KIND_125:
{
- rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node.decoded;
int cnt = 0;
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2252,10 +2349,10 @@ rt_verify_node(rt_node *node)
/* Check if the corresponding slot is used */
if (NODE_IS_LEAF(node))
- Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) node,
+ Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) n125,
n125->slot_idxs[i]));
else
- Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) node,
+ Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) n125,
n125->slot_idxs[i]));
cnt++;
@@ -2268,7 +2365,7 @@ rt_verify_node(rt_node *node)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
int cnt = 0;
for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
@@ -2289,54 +2386,62 @@ rt_verify_node(rt_node *node)
void
rt_stats(radix_tree *tree)
{
+ rt_node *root = rt_pointer_decode(tree->root);
+
+ if (root == NULL)
+ return;
+
ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
- tree->num_keys,
- tree->root->shift / RT_NODE_SPAN,
- tree->cnt[RT_CLASS_4_FULL],
- tree->cnt[RT_CLASS_32_PARTIAL],
- tree->cnt[RT_CLASS_32_FULL],
- tree->cnt[RT_CLASS_125_FULL],
- tree->cnt[RT_CLASS_256])));
+ tree->num_keys,
+ root->shift / RT_NODE_SPAN,
+ tree->cnt[RT_CLASS_4_FULL],
+ tree->cnt[RT_CLASS_32_PARTIAL],
+ tree->cnt[RT_CLASS_32_FULL],
+ tree->cnt[RT_CLASS_125_FULL],
+ tree->cnt[RT_CLASS_256])));
}
static void
-rt_dump_node(rt_node *node, int level, bool recurse)
+rt_dump_node(rt_node_ptr node, int level, bool recurse)
{
- char space[125] = {0};
+ rt_node *n = node.decoded;
+ char space[128] = {0};
fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
NODE_IS_LEAF(node) ? "LEAF" : "INNR",
- (node->kind == RT_NODE_KIND_4) ? 4 :
- (node->kind == RT_NODE_KIND_32) ? 32 :
- (node->kind == RT_NODE_KIND_125) ? 125 : 256,
- node->fanout == 0 ? 256 : node->fanout,
- node->count, node->shift, node->chunk);
+
+ (n->kind == RT_NODE_KIND_4) ? 4 :
+ (n->kind == RT_NODE_KIND_32) ? 32 :
+ (n->kind == RT_NODE_KIND_125) ? 125 : 256,
+ n->fanout == 0 ? 256 : n->fanout,
+ n->count, n->shift, n->chunk);
if (level > 0)
sprintf(space, "%*c", level * 4, ' ');
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- for (int i = 0; i < node->count; i++)
+ for (int i = 0; i < NODE_COUNT(node); i++)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
space, n4->base.chunks[i], n4->values[i]);
}
else
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
fprintf(stderr, "%schunk 0x%X ->",
space, n4->base.chunks[i]);
if (recurse)
- rt_dump_node(n4->children[i], level + 1, recurse);
+ rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+ level + 1, recurse);
else
fprintf(stderr, "\n");
}
@@ -2345,25 +2450,26 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
case RT_NODE_KIND_32:
{
- for (int i = 0; i < node->count; i++)
+ for (int i = 0; i < NODE_KIND(node); i++)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
space, n32->base.chunks[i], n32->values[i]);
}
else
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
fprintf(stderr, "%schunk 0x%X ->",
space, n32->base.chunks[i]);
if (recurse)
{
- rt_dump_node(n32->children[i], level + 1, recurse);
+ rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+ level + 1, recurse);
}
else
fprintf(stderr, "\n");
@@ -2373,7 +2479,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
case RT_NODE_KIND_125:
{
- rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+ rt_node_base_125 *b125 = (rt_node_base_125 *) node.decoded;
fprintf(stderr, "slot_idxs ");
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2385,7 +2491,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+ rt_node_leaf_125 *n = (rt_node_leaf_125 *) node.decoded;
fprintf(stderr, ", isset-bitmap:");
for (int i = 0; i < WORDNUM(128); i++)
@@ -2415,7 +2521,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(node_inner_125_get_child(n125, i),
+ rt_dump_node(rt_node_ptr_encoded(node_inner_125_get_child(n125, i)),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2429,7 +2535,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
if (!node_leaf_256_is_chunk_used(n256, i))
continue;
@@ -2439,7 +2545,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
else
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
if (!node_inner_256_is_chunk_used(n256, i))
continue;
@@ -2448,8 +2554,8 @@ rt_dump_node(rt_node *node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
- recurse);
+ rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+ level + 1, recurse);
else
fprintf(stderr, "\n");
}
@@ -2462,7 +2568,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
void
rt_dump_search(radix_tree *tree, uint64 key)
{
- rt_node *node;
+ rt_node_ptr node;
int shift;
int level = 0;
@@ -2470,7 +2576,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
tree->max_val, tree->max_val);
- if (!tree->root)
+ if (!RTPointerIsValid(tree->root))
{
elog(NOTICE, "tree is empty");
return;
@@ -2483,11 +2589,11 @@ rt_dump_search(radix_tree *tree, uint64 key)
return;
}
- node = tree->root;
- shift = tree->root->shift;
+ node = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
while (shift >= 0)
{
- rt_node *child;
+ rt_pointer child;
rt_dump_node(node, level, false);
@@ -2504,7 +2610,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
break;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
level++;
}
@@ -2513,6 +2619,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
void
rt_dump(radix_tree *tree)
{
+ rt_node_ptr root;
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
@@ -2523,12 +2630,13 @@ rt_dump(radix_tree *tree)
rt_size_class_info[i].leaf_blocksize);
fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
- if (!tree->root)
+ if (!RTPointerIsValid(tree->root))
{
fprintf(stderr, "empty tree\n");
return;
}
- rt_dump_node(tree->root, 0, true);
+ root = rt_node_ptr_encoded(tree->root);
+ rt_dump_node(root, 0, true);
}
#endif
--
2.31.1
v12-0003-tool-for-measuring-radix-tree-performance.patchapplication/octet-stream; name=v12-0003-tool-for-measuring-radix-tree-performance.patchDownload
From 1e244ff8963101b8a74fb3db01fae19f15d620a3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v12 3/7] tool for measuring radix tree performance
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 76 +++
contrib/bench_radix_tree/bench_radix_tree.c | 635 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
6 files changed, 767 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..83529805fc
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..a0693695e6
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,635 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 search_time_ms;
+ Datum values[2] = {0};
+ bool nulls[2] = {0};
+ /* from trial and error */
+ uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (!PG_ARGISNULL(1))
+ {
+ char *filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+ if (sscanf(filter_str, "0x%lX", &filter) == 0)
+ elog(ERROR, "invalid filter string %s", filter_str);
+ }
+ elog(NOTICE, "bench with filter 0x%lX", filter);
+
+ rt = rt_create(CurrentMemoryContext);
+
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+
+ rt_set(rt, key, key);
+ }
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_sparseload_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ uint64 r,
+ h;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ key_id = 0;
+
+ for (r = 1; r <= fanout; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (h);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+
+ rt_stats(rt);
+
+ /* measure sparse deletion and re-loading */
+ start_time = GetCurrentTimestamp();
+
+ for (int t = 0; t<10000; t++)
+ {
+ /* delete one key in each leaf */
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ rt_delete(rt, key);
+ }
+
+ /* add them all back */
+ key_id = 0;
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_sparseload_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
--
2.31.1
v12-0006-PoC-DSA-support-for-radix-tree.patchapplication/octet-stream; name=v12-0006-PoC-DSA-support-for-radix-tree.patchDownload
From 07daf71cbc20e445c6897e4e7790c85c5d59637d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 27 Oct 2022 14:02:00 +0900
Subject: [PATCH v12 6/7] PoC: DSA support for radix tree.
---
.../bench_radix_tree--1.0.sql | 2 +
contrib/bench_radix_tree/bench_radix_tree.c | 16 +-
src/backend/lib/radixtree.c | 437 ++++++++++++++----
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 8 +-
src/include/utils/dsa.h | 1 +
.../expected/test_radixtree.out | 25 +
.../modules/test_radixtree/test_radixtree.c | 147 ++++--
8 files changed, 502 insertions(+), 146 deletions(-)
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 83529805fc..d9216d715c 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -7,6 +7,7 @@ create function bench_shuffle_search(
minblk int4,
maxblk int4,
random_block bool DEFAULT false,
+shared bool DEFAULT false,
OUT nkeys int8,
OUT rt_mem_allocated int8,
OUT array_mem_allocated int8,
@@ -23,6 +24,7 @@ create function bench_seq_search(
minblk int4,
maxblk int4,
random_block bool DEFAULT false,
+shared bool DEFAULT false,
OUT nkeys int8,
OUT rt_mem_allocated int8,
OUT array_mem_allocated int8,
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index a0693695e6..1a26722495 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -154,6 +154,8 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
BlockNumber maxblk = PG_GETARG_INT32(1);
bool random_block = PG_GETARG_BOOL(2);
radix_tree *rt = NULL;
+ bool shared = PG_GETARG_BOOL(3);
+ dsa_area *dsa = NULL;
uint64 ntids;
uint64 key;
uint64 last_key = PG_UINT64_MAX;
@@ -176,7 +178,11 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
/* measure the load time of the radix tree */
- rt = rt_create(CurrentMemoryContext);
+ if (shared)
+ dsa = dsa_create(LWLockNewTrancheId());
+ rt = rt_create(CurrentMemoryContext, dsa);
+
+ /* measure the load time of the radix tree */
start_time = GetCurrentTimestamp();
for (int i = 0; i < ntids; i++)
{
@@ -327,7 +333,7 @@ bench_load_random_int(PG_FUNCTION_ARGS)
elog(ERROR, "return type must be a row type");
pg_prng_seed(&state, 0);
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
start_time = GetCurrentTimestamp();
for (uint64 i = 0; i < cnt; i++)
@@ -393,7 +399,7 @@ bench_search_random_nodes(PG_FUNCTION_ARGS)
}
elog(NOTICE, "bench with filter 0x%lX", filter);
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
for (uint64 i = 0; i < cnt; i++)
{
@@ -462,7 +468,7 @@ bench_fixed_height_search(PG_FUNCTION_ARGS)
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
start_time = GetCurrentTimestamp();
@@ -574,7 +580,7 @@ bench_node128_load(PG_FUNCTION_ARGS)
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
key_id = 0;
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index a97d86ae2b..58e947f9df 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -22,6 +22,15 @@
* choose it to avoid an additional pointer traversal. It is the reason this code
* currently does not support variable-length keys.
*
+ * If DSA area is specified for rt_create(), the radix tree is created in the
+ * DSA area so that multiple processes can access to it simultaneously. The process
+ * who created the shared radix tree needs to tell both DSA area specified when
+ * calling to rt_create() and dsa_pointer of the radix tree, fetched by
+ * rt_get_dsa_pointer(), to other processes so that they can attach by rt_attach().
+ *
+ * XXX: shared radix tree is still PoC state as it doesn't have any locking support.
+ * Also, it supports the iteration only by one process.
+ *
* XXX: Most functions in this file have two variants for inner nodes and leaf
* nodes, therefore there are duplication codes. While this sometimes makes the
* code maintenance tricky, this reduces branch prediction misses when judging
@@ -34,6 +43,9 @@
*
* rt_create - Create a new, empty radix tree
* rt_free - Free the radix tree
+ * rt_attach - Attach to the radix tree
+ * rt_detach - Detach from the radix tree
+ * rt_get_handle - Return the handle of the radix tree
* rt_search - Search a key-value pair
* rt_set - Set a key-value pair
* rt_delete - Delete a key-value pair
@@ -64,6 +76,7 @@
#include "miscadmin.h"
#include "port/pg_bitutils.h"
#include "port/pg_lfind.h"
+#include "utils/dsa.h"
#include "utils/memutils.h"
#ifdef RT_DEBUG
@@ -421,6 +434,10 @@ static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
* construct the key whenever updating the node iteration information, e.g., when
* advancing the current index within the node or when moving to the next node
* at the same level.
+ *
+ * XXX: We need either a safeguard to disallow other processes to begin the
+ * iteration while one process is doing or to allow multiple processes to do
+ * the iteration.
*/
typedef struct rt_node_iter
{
@@ -440,23 +457,43 @@ struct rt_iter
uint64 key;
};
-/* A radix tree with nodes */
-struct radix_tree
+/* A magic value used to identify our radix tree */
+#define RADIXTREE_MAGIC 0x54A48167
+
+/* Control information for an radix tree */
+typedef struct radix_tree_control
{
- MemoryContext context;
+ rt_handle handle;
+ uint32 magic;
+ /* Root node */
rt_pointer root;
+
uint64 max_val;
uint64 num_keys;
- MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
- MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
-
/* statistics */
#ifdef RT_DEBUG
int32 cnt[RT_SIZE_CLASS_COUNT];
#endif
+} radix_tree_control;
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ /* control object in either backend-local memory or DSA */
+ radix_tree_control *ctl;
+
+ /* used only when the radix tree is shared */
+ dsa_area *area;
+
+ /* used only when the radix tree is private */
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
};
+#define RadixTreeIsShared(rt) ((rt)->area != NULL)
static void rt_new_root(radix_tree *tree, uint64 key);
@@ -485,9 +522,12 @@ static void rt_verify_node(rt_node_ptr node);
/* Decode and encode functions of rt_pointer */
static inline rt_node *
-rt_pointer_decode(rt_pointer encoded)
+rt_pointer_decode(radix_tree *tree, rt_pointer encoded)
{
- return (rt_node *) encoded;
+ if (RadixTreeIsShared(tree))
+ return (rt_node *) dsa_get_address(tree->area, encoded);
+ else
+ return (rt_node *) encoded;
}
static inline rt_pointer
@@ -498,11 +538,11 @@ rt_pointer_encode(rt_node *decoded)
/* Return a rt_node_ptr created from the given encoded pointer */
static inline rt_node_ptr
-rt_node_ptr_encoded(rt_pointer encoded)
+rt_node_ptr_encoded(radix_tree *tree, rt_pointer encoded)
{
return (rt_node_ptr) {
.encoded = encoded,
- .decoded = rt_pointer_decode(encoded),
+ .decoded = rt_pointer_decode(tree, encoded)
};
}
@@ -949,8 +989,8 @@ rt_new_root(radix_tree *tree, uint64 key)
rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
NODE_SHIFT(newnode) = shift;
- tree->max_val = shift_get_max_val(shift);
- tree->root = newnode.encoded;
+ tree->ctl->max_val = shift_get_max_val(shift);
+ tree->ctl->root = newnode.encoded;
}
/*
@@ -959,20 +999,35 @@ rt_new_root(radix_tree *tree, uint64 key)
static rt_node_ptr
rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
{
- rt_node_ptr newnode;
+ rt_node_ptr newnode;
- if (inner)
- newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
- rt_size_class_info[size_class].inner_size);
- else
- newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
- rt_size_class_info[size_class].leaf_size);
+ if (tree->area != NULL)
+ {
+ dsa_pointer dp;
- newnode.encoded = rt_pointer_encode(newnode.decoded);
+ if (inner)
+ dp = dsa_allocate(tree->area, rt_size_class_info[size_class].inner_size);
+ else
+ dp = dsa_allocate(tree->area, rt_size_class_info[size_class].leaf_size);
+
+ newnode.encoded = (rt_pointer) dp;
+ newnode.decoded = rt_pointer_decode(tree, newnode.encoded);
+ }
+ else
+ {
+ if (inner)
+ newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+ rt_size_class_info[size_class].inner_size);
+ else
+ newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ rt_size_class_info[size_class].leaf_size);
+
+ newnode.encoded = rt_pointer_encode(newnode.decoded);
+ }
#ifdef RT_DEBUG
/* update the statistics */
- tree->cnt[size_class]++;
+ tree->ctl->cnt[size_class]++;
#endif
return newnode;
@@ -1036,10 +1091,10 @@ static void
rt_free_node(radix_tree *tree, rt_node_ptr node)
{
/* If we're deleting the root node, make the tree empty */
- if (tree->root == node.encoded)
+ if (tree->ctl->root == node.encoded)
{
- tree->root = InvalidRTPointer;
- tree->max_val = 0;
+ tree->ctl->root = InvalidRTPointer;
+ tree->ctl->max_val = 0;
}
#ifdef RT_DEBUG
@@ -1057,12 +1112,15 @@ rt_free_node(radix_tree *tree, rt_node_ptr node)
if (i == RT_SIZE_CLASS_COUNT)
i = RT_CLASS_256;
- tree->cnt[i]--;
- Assert(tree->cnt[i] >= 0);
+ tree->ctl->cnt[i]--;
+ Assert(tree->ctl->cnt[i] >= 0);
}
#endif
- pfree(node.decoded);
+ if (RadixTreeIsShared(tree))
+ dsa_free(tree->area, (dsa_pointer) node.encoded);
+ else
+ pfree(node.decoded);
}
/*
@@ -1078,7 +1136,7 @@ rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
if (rt_node_ptr_eq(&parent, &old_child))
{
/* Replace the root node with the new large node */
- tree->root = new_child.encoded;
+ tree->ctl->root = new_child.encoded;
}
else
{
@@ -1100,7 +1158,7 @@ static void
rt_extend(radix_tree *tree, uint64 key)
{
int target_shift;
- rt_node *root = rt_pointer_decode(tree->root);
+ rt_node *root = rt_pointer_decode(tree, tree->ctl->root);
int shift = root->shift + RT_NODE_SPAN;
target_shift = key_get_shift(key);
@@ -1118,15 +1176,15 @@ rt_extend(radix_tree *tree, uint64 key)
n4->base.n.shift = shift;
n4->base.n.count = 1;
n4->base.chunks[0] = 0;
- n4->children[0] = tree->root;
+ n4->children[0] = tree->ctl->root;
root->chunk = 0;
- tree->root = node.encoded;
+ tree->ctl->root = node.encoded;
shift += RT_NODE_SPAN;
}
- tree->max_val = shift_get_max_val(target_shift);
+ tree->ctl->max_val = shift_get_max_val(target_shift);
}
/*
@@ -1158,7 +1216,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
}
rt_node_insert_leaf(tree, parent, node, key, value);
- tree->num_keys++;
+ tree->ctl->num_keys++;
}
/*
@@ -1169,12 +1227,11 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
* pointer is set to child_p.
*/
static inline bool
-rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
- rt_pointer *child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action, rt_pointer *child_p)
{
uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool found = false;
- rt_pointer child;
+ rt_pointer child = InvalidRTPointer;
switch (NODE_KIND(node))
{
@@ -1205,6 +1262,7 @@ rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
break;
found = true;
+
if (action == RT_ACTION_FIND)
child = n32->children[idx];
else /* RT_ACTION_DELETE */
@@ -1756,33 +1814,51 @@ retry_insert_leaf_32:
* Create the radix tree in the given memory context and return it.
*/
radix_tree *
-rt_create(MemoryContext ctx)
+rt_create(MemoryContext ctx, dsa_area *area)
{
radix_tree *tree;
MemoryContext old_ctx;
old_ctx = MemoryContextSwitchTo(ctx);
- tree = palloc(sizeof(radix_tree));
+ tree = (radix_tree *) palloc0(sizeof(radix_tree));
tree->context = ctx;
- tree->root = InvalidRTPointer;
- tree->max_val = 0;
- tree->num_keys = 0;
+
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+
+ tree->area = area;
+ dp = dsa_allocate0(area, sizeof(radix_tree_control));
+ tree->ctl = (radix_tree_control *) dsa_get_address(area, dp);
+ tree->ctl->handle = (rt_handle) dp;
+ }
+ else
+ {
+ tree->ctl = (radix_tree_control *) palloc0(sizeof(radix_tree_control));
+ tree->ctl->handle = InvalidDsaPointer;
+ }
+
+ tree->ctl->magic = RADIXTREE_MAGIC;
+ tree->ctl->root = InvalidRTPointer;
/* Create the slab allocator for each size class */
- for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ if (area == NULL)
{
- tree->inner_slabs[i] = SlabContextCreate(ctx,
- rt_size_class_info[i].name,
- rt_size_class_info[i].inner_blocksize,
- rt_size_class_info[i].inner_size);
- tree->leaf_slabs[i] = SlabContextCreate(ctx,
- rt_size_class_info[i].name,
- rt_size_class_info[i].leaf_blocksize,
- rt_size_class_info[i].leaf_size);
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].leaf_blocksize,
+ rt_size_class_info[i].leaf_size);
#ifdef RT_DEBUG
- tree->cnt[i] = 0;
+ tree->ctl->cnt[i] = 0;
#endif
+ }
}
MemoryContextSwitchTo(old_ctx);
@@ -1790,16 +1866,163 @@ rt_create(MemoryContext ctx)
return tree;
}
+/*
+ * Get a handle that can be used by other processes to attach to this radix
+ * tree.
+ */
+dsa_pointer
+rt_get_handle(radix_tree *tree)
+{
+ Assert(RadixTreeIsShared(tree));
+ Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+ return tree->ctl->handle;
+}
+
+/*
+ * Attach to an existing radix tree using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+radix_tree *
+rt_attach(dsa_area *area, rt_handle handle)
+{
+ radix_tree *tree;
+ dsa_pointer control;
+
+ /* Allocate the backend-local object representing the radix tree */
+ tree = (radix_tree *) palloc0(sizeof(radix_tree));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ /* Set up the local radix tree */
+ tree->area = area;
+ tree->ctl = (radix_tree_control *) dsa_get_address(area, control);
+ Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+ return tree;
+}
+
+/*
+ * Detach from a radix tree. This frees backend-local resources associated
+ * with the radix tree, but the radix tree will continue to exist until
+ * it is explicitly freed.
+ */
+void
+rt_detach(radix_tree *tree)
+{
+ Assert(RadixTreeIsShared(tree));
+ Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+ pfree(tree);
+}
+
+/*
+ * Recursively free all nodes allocated to the dsa area.
+ */
+static void
+rt_free_recurse(radix_tree *tree, rt_pointer ptr)
+{
+ rt_node_ptr node = rt_node_ptr_encoded(tree, ptr);
+
+ Assert(RadixTreeIsShared(tree));
+
+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();
+
+ /* The leaf node doesn't have child pointers, so free it */
+ if (NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->area, (dsa_pointer) node.encoded);
+ return;
+ }
+
+ switch (NODE_KIND(node))
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < NODE_COUNT(node); i++)
+ rt_free_recurse(tree, n4->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < NODE_COUNT(node); i++)
+ rt_free_recurse(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ continue;
+
+ rt_free_recurse(tree, node_inner_125_get_child(n125, i));
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_inner_256_is_chunk_used(n256, i))
+ continue;
+
+ rt_free_recurse(tree, node_inner_256_get_child(n256, i));
+ }
+ break;
+ }
+ }
+
+ /* Free the inner node itself */
+ dsa_free(tree->area, node.encoded);
+}
+
/*
* Free the given radix tree.
*/
void
rt_free(radix_tree *tree)
{
- for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+ if (RadixTreeIsShared(tree))
{
- MemoryContextDelete(tree->inner_slabs[i]);
- MemoryContextDelete(tree->leaf_slabs[i]);
+ /* Free all memory used for radix tree nodes */
+ if (RTPointerIsValid(tree->ctl->root))
+ rt_free_recurse(tree, tree->ctl->root);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->area, tree->ctl->handle);
+ }
+ else
+ {
+ /* Free all memory used for radix tree nodes */
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+ pfree(tree->ctl);
}
pfree(tree);
@@ -1817,16 +2040,18 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
rt_node_ptr node;
rt_node_ptr parent;
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
/* Empty tree, create the root */
- if (!RTPointerIsValid(tree->root))
+ if (!RTPointerIsValid(tree->ctl->root))
rt_new_root(tree, key);
/* Extend the tree if necessary */
- if (key > tree->max_val)
+ if (key > tree->ctl->max_val)
rt_extend(tree, key);
/* Descend the tree until a leaf node */
- node = parent = rt_node_ptr_encoded(tree->root);
+ node = parent = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
while (shift >= 0)
{
@@ -1842,7 +2067,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
}
parent = node;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
}
@@ -1850,7 +2075,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
/* Update the statistics */
if (!updated)
- tree->num_keys++;
+ tree->ctl->num_keys++;
return updated;
}
@@ -1866,12 +2091,13 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
rt_node_ptr node;
int shift;
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
Assert(value_p != NULL);
- if (!RTPointerIsValid(tree->root) || key > tree->max_val)
+ if (!RTPointerIsValid(tree->ctl->root) || key > tree->ctl->max_val)
return false;
- node = rt_node_ptr_encoded(tree->root);
+ node = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
/* Descend the tree until a leaf node */
@@ -1885,7 +2111,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
}
@@ -1905,14 +2131,16 @@ rt_delete(radix_tree *tree, uint64 key)
int level;
bool deleted;
- if (!tree->root || key > tree->max_val)
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+ if (!RTPointerIsValid(tree->ctl->root) || key > tree->ctl->max_val)
return false;
/*
* Descend the tree to search the key while building a stack of nodes we
* visited.
*/
- node = rt_node_ptr_encoded(tree->root);
+ node = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
level = -1;
while (shift > 0)
@@ -1925,7 +2153,7 @@ rt_delete(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
}
@@ -1940,7 +2168,7 @@ rt_delete(radix_tree *tree, uint64 key)
}
/* Found the key to delete. Update the statistics */
- tree->num_keys--;
+ tree->ctl->num_keys--;
/*
* Return if the leaf node still has keys and we don't need to delete the
@@ -1980,16 +2208,18 @@ rt_begin_iterate(radix_tree *tree)
rt_iter *iter;
int top_level;
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
old_ctx = MemoryContextSwitchTo(tree->context);
iter = (rt_iter *) palloc0(sizeof(rt_iter));
iter->tree = tree;
/* empty tree */
- if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->root))
+ if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->ctl->root))
return iter;
- root = rt_node_ptr_encoded(iter->tree->root);
+ root = rt_node_ptr_encoded(tree, iter->tree->ctl->root);
top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
iter->stack_len = top_level;
@@ -2040,8 +2270,10 @@ rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
bool
rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
{
+ Assert(!RadixTreeIsShared(iter->tree) || iter->tree->ctl->magic == RADIXTREE_MAGIC);
+
/* Empty tree */
- if (!iter->tree->root)
+ if (!iter->tree->ctl->root)
return false;
for (;;)
@@ -2185,7 +2417,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *
if (found)
{
rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
- *child_p = rt_node_ptr_encoded(child);
+ *child_p = rt_node_ptr_encoded(iter->tree, child);
}
return found;
@@ -2288,7 +2520,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_
uint64
rt_num_entries(radix_tree *tree)
{
- return tree->num_keys;
+ return tree->ctl->num_keys;
}
/*
@@ -2297,12 +2529,19 @@ rt_num_entries(radix_tree *tree)
uint64
rt_memory_usage(radix_tree *tree)
{
- Size total = sizeof(radix_tree);
+ Size total = sizeof(radix_tree) + sizeof(radix_tree_control);
- for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+ if (RadixTreeIsShared(tree))
+ total = dsa_get_total_size(tree->area);
+ else
{
- total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
- total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
}
return total;
@@ -2386,23 +2625,23 @@ rt_verify_node(rt_node_ptr node)
void
rt_stats(radix_tree *tree)
{
- rt_node *root = rt_pointer_decode(tree->root);
+ rt_node *root = rt_pointer_decode(tree, tree->ctl->root);
if (root == NULL)
return;
ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
- tree->num_keys,
+ tree->ctl->num_keys,
root->shift / RT_NODE_SPAN,
- tree->cnt[RT_CLASS_4_FULL],
- tree->cnt[RT_CLASS_32_PARTIAL],
- tree->cnt[RT_CLASS_32_FULL],
- tree->cnt[RT_CLASS_125_FULL],
- tree->cnt[RT_CLASS_256])));
+ tree->ctl->cnt[RT_CLASS_4_FULL],
+ tree->ctl->cnt[RT_CLASS_32_PARTIAL],
+ tree->ctl->cnt[RT_CLASS_32_FULL],
+ tree->ctl->cnt[RT_CLASS_125_FULL],
+ tree->ctl->cnt[RT_CLASS_256])));
}
static void
-rt_dump_node(rt_node_ptr node, int level, bool recurse)
+rt_dump_node(radix_tree *tree, rt_node_ptr node, int level, bool recurse)
{
rt_node *n = node.decoded;
char space[128] = {0};
@@ -2440,7 +2679,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
space, n4->base.chunks[i]);
if (recurse)
- rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+ rt_dump_node(tree, rt_node_ptr_encoded(tree, n4->children[i]),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2468,7 +2707,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
if (recurse)
{
- rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+ rt_dump_node(tree, rt_node_ptr_encoded(tree, n32->children[i]),
level + 1, recurse);
}
else
@@ -2521,7 +2760,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(rt_node_ptr_encoded(node_inner_125_get_child(n125, i)),
+ rt_dump_node(tree,
+ rt_node_ptr_encoded(tree,
+ node_inner_125_get_child(n125, i)),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2554,7 +2795,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+ rt_dump_node(tree,
+ rt_node_ptr_encoded(tree,
+ node_inner_256_get_child(n256, i)),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2574,28 +2817,28 @@ rt_dump_search(radix_tree *tree, uint64 key)
elog(NOTICE, "-----------------------------------------------------------");
elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
- tree->max_val, tree->max_val);
+ tree->ctl->max_val, tree->ctl->max_val);
- if (!RTPointerIsValid(tree->root))
+ if (!RTPointerIsValid(tree->ctl->root))
{
elog(NOTICE, "tree is empty");
return;
}
- if (key > tree->max_val)
+ if (key > tree->ctl->max_val)
{
elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
key, key);
return;
}
- node = rt_node_ptr_encoded(tree->root);
+ node = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
while (shift >= 0)
{
rt_pointer child;
- rt_dump_node(node, level, false);
+ rt_dump_node(tree, node, level, false);
if (NODE_IS_LEAF(node))
{
@@ -2610,7 +2853,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
break;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
level++;
}
@@ -2628,15 +2871,15 @@ rt_dump(radix_tree *tree)
rt_size_class_info[i].inner_blocksize,
rt_size_class_info[i].leaf_size,
rt_size_class_info[i].leaf_blocksize);
- fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
- if (!RTPointerIsValid(tree->root))
+ if (!RTPointerIsValid(tree->ctl->root))
{
fprintf(stderr, "empty tree\n");
return;
}
- root = rt_node_ptr_encoded(tree->root);
- rt_dump_node(root, 0, true);
+ root = rt_node_ptr_encoded(tree, tree->ctl->root);
+ rt_dump_node(tree, root, 0, true);
}
#endif
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 82376fde2d..ad169882af 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d5d7668617..68a11df970 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -14,18 +14,24 @@
#define RADIXTREE_H
#include "postgres.h"
+#include "utils/dsa.h"
#define RT_DEBUG 1
typedef struct radix_tree radix_tree;
typedef struct rt_iter rt_iter;
+typedef dsa_pointer rt_handle;
-extern radix_tree *rt_create(MemoryContext ctx);
+extern radix_tree *rt_create(MemoryContext ctx, dsa_area *dsa);
extern void rt_free(radix_tree *tree);
extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
extern rt_iter *rt_begin_iterate(radix_tree *tree);
+extern rt_handle rt_get_handle(radix_tree *tree);
+extern radix_tree *rt_attach(dsa_area *dsa, dsa_pointer dp);
+extern void rt_detach(radix_tree *tree);
+
extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
extern void rt_end_iterate(rt_iter *iter);
extern bool rt_delete(radix_tree *tree, uint64 key);
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 405606fe2f..dad06adecc 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index ce645cb8b5..a217e0d312 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -6,28 +6,53 @@ CREATE EXTENSION test_radixtree;
SELECT test_radixtree();
NOTICE: testing basic operations with leaf node 4
NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
NOTICE: testing basic operations with leaf node 32
NOTICE: testing basic operations with inner node 32
NOTICE: testing basic operations with leaf node 125
NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
NOTICE: testing basic operations with leaf node 256
NOTICE: testing basic operations with inner node 256
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
NOTICE: testing radix tree node types with shift "0"
NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "8"
NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
NOTICE: testing radix tree node types with shift "24"
NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "32"
NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
NOTICE: testing radix tree node types with shift "48"
NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree node types with shift "56"
NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
NOTICE: testing radix tree with pattern "alternating bits"
NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of ten"
NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
NOTICE: testing radix tree with pattern "one-every-64k"
NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "sparse"
NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
test_radixtree
----------------
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index ea993e63df..fe1e168ec4 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -19,6 +19,7 @@
#include "nodes/bitmapset.h"
#include "storage/block.h"
#include "storage/itemptr.h"
+#include "storage/lwlock.h"
#include "utils/memutils.h"
#include "utils/timestamp.h"
@@ -99,6 +100,8 @@ static const test_spec test_specs[] = {
}
};
+static int lwlock_tranche_id;
+
PG_MODULE_MAGIC;
PG_FUNCTION_INFO_V1(test_radixtree);
@@ -112,7 +115,7 @@ test_empty(void)
uint64 key;
uint64 val;
- radixtree = rt_create(CurrentMemoryContext);
+ radixtree = rt_create(CurrentMemoryContext, NULL);
if (rt_search(radixtree, 0, &dummy))
elog(ERROR, "rt_search on empty tree returned true");
@@ -140,17 +143,14 @@ test_empty(void)
}
static void
-test_basic(int children, bool test_inner)
+do_test_basic(radix_tree *radixtree, int children, bool test_inner)
{
- radix_tree *radixtree;
uint64 *keys;
int shift = test_inner ? 8 : 0;
elog(NOTICE, "testing basic operations with %s node %d",
test_inner ? "inner" : "leaf", children);
- radixtree = rt_create(CurrentMemoryContext);
-
/* prepare keys in order like 1, 32, 2, 31, 2, ... */
keys = palloc(sizeof(uint64) * children);
for (int i = 0; i < children; i++)
@@ -165,7 +165,7 @@ test_basic(int children, bool test_inner)
for (int i = 0; i < children; i++)
{
if (rt_set(radixtree, keys[i], keys[i]))
- elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found %d", keys[i], i);
}
/* update keys */
@@ -185,7 +185,38 @@ test_basic(int children, bool test_inner)
}
pfree(keys);
- rt_free(radixtree);
+}
+
+static void
+test_basic()
+{
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ radix_tree *tree;
+ dsa_area *area;
+
+ /* Test the local radix tree */
+ tree = rt_create(CurrentMemoryContext, NULL);
+ do_test_basic(tree, rt_node_kind_fanouts[i], false);
+ rt_free(tree);
+
+ tree = rt_create(CurrentMemoryContext, NULL);
+ do_test_basic(tree, rt_node_kind_fanouts[i], true);
+ rt_free(tree);
+
+ /* Test the shared radix tree */
+ area = dsa_create(lwlock_tranche_id);
+ tree = rt_create(CurrentMemoryContext, area);
+ do_test_basic(tree, rt_node_kind_fanouts[i], false);
+ rt_free(tree);
+ dsa_detach(area);
+
+ area = dsa_create(lwlock_tranche_id);
+ tree = rt_create(CurrentMemoryContext, area);
+ do_test_basic(tree, rt_node_kind_fanouts[i], true);
+ rt_free(tree);
+ dsa_detach(area);
+ }
}
/*
@@ -286,14 +317,10 @@ test_node_types_delete(radix_tree *radixtree, uint8 shift)
* level.
*/
static void
-test_node_types(uint8 shift)
+do_test_node_types(radix_tree *radixtree, uint8 shift)
{
- radix_tree *radixtree;
-
elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
- radixtree = rt_create(CurrentMemoryContext);
-
/*
* Insert and search entries for every node type at the 'shift' level,
* then delete all entries to make it empty, and insert and search entries
@@ -302,19 +329,37 @@ test_node_types(uint8 shift)
test_node_types_insert(radixtree, shift, true);
test_node_types_delete(radixtree, shift);
test_node_types_insert(radixtree, shift, false);
+}
- rt_free(radixtree);
+static void
+test_node_types(void)
+{
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ {
+ radix_tree *tree;
+ dsa_area *area;
+
+ /* Test the local radix tree */
+ tree = rt_create(CurrentMemoryContext, NULL);
+ do_test_node_types(tree, shift);
+ rt_free(tree);
+
+ /* Test the shared radix tree */
+ area = dsa_create(lwlock_tranche_id);
+ tree = rt_create(CurrentMemoryContext, area);
+ do_test_node_types(tree, shift);
+ rt_free(tree);
+ dsa_detach(area);
+ }
}
/*
* Test with a repeating pattern, defined by the 'spec'.
*/
static void
-test_pattern(const test_spec * spec)
+do_test_pattern(radix_tree *radixtree, const test_spec * spec)
{
- radix_tree *radixtree;
rt_iter *iter;
- MemoryContext radixtree_ctx;
TimestampTz starttime;
TimestampTz endtime;
uint64 n;
@@ -340,18 +385,6 @@ test_pattern(const test_spec * spec)
pattern_values[pattern_num_values++] = i;
}
- /*
- * Allocate the radix tree.
- *
- * Allocate it in a separate memory context, so that we can print its
- * memory usage easily.
- */
- radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
- "radixtree test",
- ALLOCSET_SMALL_SIZES);
- MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
- radixtree = rt_create(radixtree_ctx);
-
/*
* Add values to the set.
*/
@@ -405,8 +438,6 @@ test_pattern(const test_spec * spec)
mem_usage = rt_memory_usage(radixtree);
fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
mem_usage, (double) mem_usage / spec->num_values);
-
- MemoryContextStats(radixtree_ctx);
}
/* Check that rt_num_entries works */
@@ -555,27 +586,57 @@ test_pattern(const test_spec * spec)
if ((nbefore - ndeleted) != nafter)
elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
nafter, (nbefore - ndeleted), ndeleted);
+}
+
+static void
+test_patterns(void)
+{
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ {
+ radix_tree *tree;
+ MemoryContext radixtree_ctx;
+ dsa_area *area;
+ const test_spec *spec = &test_specs[i];
- MemoryContextDelete(radixtree_ctx);
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+ /* Test the local radix tree */
+ tree = rt_create(radixtree_ctx, NULL);
+ do_test_pattern(tree, spec);
+ rt_free(tree);
+ MemoryContextReset(radixtree_ctx);
+
+ /* Test the shared radix tree */
+ area = dsa_create(lwlock_tranche_id);
+ tree = rt_create(radixtree_ctx, area);
+ do_test_pattern(tree, spec);
+ rt_free(tree);
+ dsa_detach(area);
+ MemoryContextDelete(radixtree_ctx);
+ }
}
Datum
test_radixtree(PG_FUNCTION_ARGS)
{
- test_empty();
+ /* get a new lwlock tranche id for all tests for shared radix tree */
+ lwlock_tranche_id = LWLockNewTrancheId();
- for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
- {
- test_basic(rt_node_kind_fanouts[i], false);
- test_basic(rt_node_kind_fanouts[i], true);
- }
-
- for (int shift = 0; shift <= (64 - 8); shift += 8)
- test_node_types(shift);
+ test_empty();
+ test_basic();
- /* Test different test patterns, with lots of entries */
- for (int i = 0; i < lengthof(test_specs); i++)
- test_pattern(&test_specs[i]);
+ test_node_types();
+ test_patterns();
PG_RETURN_VOID();
}
--
2.31.1
v12-0004-Use-bitmapword-for-node-125.patchapplication/octet-stream; name=v12-0004-Use-bitmapword-for-node-125.patchDownload
From 25cc8623d65d333e68cb43a792ba3055bf89b7c9 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 2 Dec 2022 15:27:06 +0900
Subject: [PATCH v12 4/7] Use bitmapword for node-125
---
src/backend/lib/radixtree.c | 71 +++++++++++++++-------------------
src/backend/nodes/bitmapset.c | 38 ------------------
src/include/nodes/bitmapset.h | 22 +----------
src/include/port/pg_bitutils.h | 58 +++++++++++++++++++++++++++
4 files changed, 91 insertions(+), 98 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index e7f61fd943..673cc5e46b 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -207,6 +207,9 @@ typedef struct rt_node_base125
/* The index of slots for each fanout */
uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[WORDNUM(128)];
} rt_node_base_125;
typedef struct rt_node_base256
@@ -271,9 +274,6 @@ typedef struct rt_node_leaf_125
{
rt_node_base_125 base;
- /* isset is a bitmap to track which slot is in use */
- uint8 isset[RT_NODE_NSLOTS_BITS(128)];
-
/* number of values depends on size class */
uint64 values[FLEXIBLE_ARRAY_MEMBER];
} rt_node_leaf_125;
@@ -655,13 +655,14 @@ node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
}
+#ifdef USE_ASSERT_CHECKING
/* Is the slot in the node used? */
static inline bool
node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
{
Assert(!NODE_IS_LEAF(node));
Assert(slot < node->base.n.fanout);
- return (node->children[slot] != NULL);
+ return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
}
static inline bool
@@ -669,8 +670,9 @@ node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
{
Assert(NODE_IS_LEAF(node));
Assert(slot < node->base.n.fanout);
- return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+ return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
}
+#endif
static inline rt_node *
node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
@@ -690,7 +692,10 @@ node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
static void
node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
{
+ int slotpos = node->base.slot_idxs[chunk];
+
Assert(!NODE_IS_LEAF(node));
+ node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
node->children[node->base.slot_idxs[chunk]] = NULL;
node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
}
@@ -701,44 +706,35 @@ node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
int slotpos = node->base.slot_idxs[chunk];
Assert(NODE_IS_LEAF(node));
- node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+ node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
}
/* Return an unused slot in node-125 */
static int
-node_inner_125_find_unused_slot(rt_node_inner_125 *node, uint8 chunk)
-{
- int slotpos = 0;
-
- Assert(!NODE_IS_LEAF(node));
- while (node_inner_125_is_slot_used(node, slotpos))
- slotpos++;
-
- return slotpos;
-}
-
-static int
-node_leaf_125_find_unused_slot(rt_node_leaf_125 *node, uint8 chunk)
-{
- int slotpos;
-
- Assert(NODE_IS_LEAF(node));
+node_125_find_unused_slot(bitmapword *isset)
+ {
+ int slotpos;
+ int idx;
+ bitmapword inverse;
- /* We iterate over the isset bitmap per byte then check each bit */
- for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < WORDNUM(128); idx++)
{
- if (node->isset[slotpos] < 0xFF)
- break;
+ if (isset[idx] < ~((bitmapword) 0))
+ break;
}
- Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
- slotpos *= BITS_PER_BYTE;
- while (node_leaf_125_is_slot_used(node, slotpos))
- slotpos++;
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+
+ /* mark the slot used */
+ isset[idx] |= RIGHTMOST_ONE(inverse);
return slotpos;
-}
+ }
static inline void
node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
@@ -747,8 +743,7 @@ node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
Assert(!NODE_IS_LEAF(node));
- /* find unused slot */
- slotpos = node_inner_125_find_unused_slot(node, chunk);
+ slotpos = node_125_find_unused_slot(node->base.isset);
Assert(slotpos < node->base.n.fanout);
node->base.slot_idxs[chunk] = slotpos;
@@ -763,12 +758,10 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
Assert(NODE_IS_LEAF(node));
- /* find unused slot */
- slotpos = node_leaf_125_find_unused_slot(node, chunk);
+ slotpos = node_125_find_unused_slot(node->base.isset);
Assert(slotpos < node->base.n.fanout);
node->base.slot_idxs[chunk] = slotpos;
- node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
node->values[slotpos] = value;
}
@@ -2395,9 +2388,9 @@ rt_dump_node(rt_node *node, int level, bool recurse)
rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
fprintf(stderr, ", isset-bitmap:");
- for (int i = 0; i < 16; i++)
+ for (int i = 0; i < WORDNUM(128); i++)
{
- fprintf(stderr, "%X ", (uint8) n->isset[i]);
+ fprintf(stderr, UINT64_FORMAT_HEX " ", n->base.isset[i]);
}
fprintf(stderr, "\n");
}
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index b7b274aeff..3fe0fd88ce 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -23,49 +23,11 @@
#include "common/hashfn.h"
#include "nodes/bitmapset.h"
#include "nodes/pg_list.h"
-#include "port/pg_bitutils.h"
-#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD)
-#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD)
-
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
-
/*
* bms_copy - make a palloc'd copy of a bitmapset
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 2792281658..06fa21ccaa 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -21,33 +21,13 @@
#define BITMAPSET_H
#include "nodes/nodes.h"
+#include "port/pg_bitutils.h"
/*
* Forward decl to save including pg_list.h
*/
struct List;
-/*
- * Data representation
- *
- * Larger bitmap word sizes generally give better performance, so long as
- * they're not wider than the processor can handle efficiently. We use
- * 64-bit words if pointers are that large, else 32-bit words.
- */
-#if SIZEOF_VOID_P >= 8
-
-#define BITS_PER_BITMAPWORD 64
-typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
-
-#else
-
-#define BITS_PER_BITMAPWORD 32
-typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
-
-#endif
-
typedef struct Bitmapset
{
pg_node_attr(custom_copy_equal, special_read_write)
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 814e0b2dba..ad5aa2c5cf 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,51 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*
+ * Platform-specific types
+ *
+ * Larger bitmap word sizes generally give better performance, so long as
+ * they're not wider than the processor can handle efficiently. We use
+ * 64-bit words if pointers are that large, else 32-bit words.
+ */
+#if SIZEOF_VOID_P >= 8
+
+#define BITS_PER_BITMAPWORD 64
+typedef uint64 bitmapword; /* must be an unsigned type */
+typedef int64 signedbitmapword; /* must be the matching signed type */
+
+#else
+
+#define BITS_PER_BITMAPWORD 32
+typedef uint32 bitmapword; /* must be an unsigned type */
+typedef int32 signedbitmapword; /* must be the matching signed type */
+
+#endif
+
+#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD)
+
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
+
+#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
@@ -291,4 +336,17 @@ pg_rotate_left32(uint32 word, int n)
#define pg_prevpower2_size_t pg_prevpower2_64
#endif
+/* variants of some functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_leftmost_one_pos pg_leftmost_one_pos32
+#define bmw_rightmost_one_pos pg_rightmost_one_pos32
+#define bmw_popcount pg_popcount32
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_leftmost_one_pos pg_leftmost_one_pos64
+#define bmw_rightmost_one_pos pg_rightmost_one_pos64
+#define bmw_popcount pg_popcount64
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
+
#endif /* PG_BITUTILS_H */
--
2.31.1
v12-0002-Add-radix-implementation.patchapplication/octet-stream; name=v12-0002-Add-radix-implementation.patchDownload
From 68401b497992d33ef5758b5ddb75244d550240d5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v12 2/7] Add radix implementation.
---
src/backend/lib/Makefile | 1 +
src/backend/lib/meson.build | 1 +
src/backend/lib/radixtree.c | 2541 +++++++++++++++++
src/include/lib/radixtree.h | 42 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 36 +
src/test/modules/test_radixtree/meson.build | 34 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 581 ++++
.../test_radixtree/test_radixtree.control | 4 +
15 files changed, 3291 insertions(+)
create mode 100644 src/backend/lib/radixtree.c
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
integerset.o \
knapsack.o \
pairingheap.o \
+ radixtree.o \
rbtree.o \
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 48da1bddce..4303d306cd 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -9,4 +9,5 @@ backend_sources += files(
'knapsack.c',
'pairingheap.c',
'rbtree.c',
+ 'radixtree.c',
)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..e7f61fd943
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2541 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves". We
+ * choose it to avoid an additional pointer traversal. It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create - Create a new, empty radix tree
+ * rt_free - Free the radix tree
+ * rt_search - Search a key-value pair
+ * rt_set - Set a key-value pair
+ * rt_delete - Delete a key-value pair
+ * rt_begin_iterate - Begin iterating through all key-value pairs
+ * rt_iterate_next - Return next key-value pair, if any
+ * rt_end_iter - End iteration
+ * rt_memory_usage - Get the memory usage
+ * rt_num_entries - Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+ RT_ACTION_FIND = 0, /* find the key-value */
+ RT_ACTION_DELETE, /* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of rt_node. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+typedef enum rt_size_class
+{
+ RT_CLASS_4_FULL = 0,
+ RT_CLASS_32_PARTIAL,
+ RT_CLASS_32_FULL,
+ RT_CLASS_125_FULL,
+ RT_CLASS_256
+
+#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
+} rt_size_class;
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /* Max number of children. We can use uint8 because we never need to store 256 */
+ /* WIP: if we don't have a variable sized node4, this should instead be in the base
+ types as needed, since saving every byte is crucial for the smallest node kind */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+ uint8 chunk;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} rt_node;
+#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+ ((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+ ((node)->base.n.count < rt_size_class_info[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct rt_node_base_4
+{
+ rt_node n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+ rt_node n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base125
+{
+ rt_node n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+} rt_node_base_125;
+
+typedef struct rt_node_base256
+{
+ rt_node n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ * width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+ rt_node_base_4 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+ rt_node_base_4 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+ rt_node_base_32 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+ rt_node_base_32 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_125
+{
+ rt_node_base_125 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_125;
+
+typedef struct rt_node_leaf_125
+{
+ rt_node_base_125 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(128)];
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+ rt_node_base_256 base;
+
+ /* Slots for 256 children */
+ rt_node *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+ rt_node_base_256 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ uint64 values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information for each size class */
+typedef struct rt_size_class_elem
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+
+ /* slab block size */
+ Size inner_blocksize;
+ Size leaf_blocksize;
+} rt_size_class_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
+ [RT_CLASS_4_FULL] = {
+ .name = "radix tree node 4",
+ .fanout = 4,
+ .inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_PARTIAL] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_FULL] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+ },
+ [RT_CLASS_125_FULL] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(rt_node_inner_256),
+ .leaf_size = sizeof(rt_node_leaf_256),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+ },
+};
+
+/* Map from the node kind to its minimum size class */
+static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
+ [RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+ [RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+ [RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+ [RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+ rt_node *node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+ radix_tree *tree;
+
+ /* Track the iteration on nodes of each level */
+ rt_node_iter stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ rt_node *root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+ bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+ rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+ uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+ uint8 *dst_chunks, rt_node **dst_children)
+{
+ const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(rt_node *) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+ uint8 *dst_chunks, uint64 *dst_values)
+{
+ const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(uint64) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(slot < node->base.n.fanout);
+ return (node->children[slot] != NULL);
+}
+
+static inline bool
+node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(slot < node->base.n.fanout);
+ return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+static inline rt_node *
+node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+static void
+node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[node->base.slot_idxs[chunk]] = NULL;
+ node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+static void
+node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
+{
+ int slotpos = node->base.slot_idxs[chunk];
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+ node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+/* Return an unused slot in node-125 */
+static int
+node_inner_125_find_unused_slot(rt_node_inner_125 *node, uint8 chunk)
+{
+ int slotpos = 0;
+
+ Assert(!NODE_IS_LEAF(node));
+ while (node_inner_125_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+static int
+node_leaf_125_find_unused_slot(rt_node_leaf_125 *node, uint8 chunk)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ /* We iterate over the isset bitmap per byte then check each bit */
+ for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+ {
+ if (node->isset[slotpos] < 0xFF)
+ break;
+ }
+ Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+ slotpos *= BITS_PER_BYTE;
+ while (node_leaf_125_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+static inline void
+node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+ int slotpos;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ /* find unused slot */
+ slotpos = node_inner_125_find_unused_slot(node, chunk);
+ Assert(slotpos < node->base.n.fanout);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ /* find unused slot */
+ slotpos = node_leaf_125_find_unused_slot(node, chunk);
+ Assert(slotpos < node->base.n.fanout);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+ node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+static inline void
+node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(node_inner_256_is_chunk_used(node, chunk));
+ return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(node_leaf_256_is_chunk_used(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+ int shift = key_get_shift(key);
+ bool inner = shift > 0;
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+ rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newnode->shift = shift;
+ tree->max_val = shift_get_max_val(shift);
+ tree->root = newnode;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
+{
+ rt_node *newnode;
+
+ if (inner)
+ newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+ rt_size_class_info[size_class].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ rt_size_class_info[size_class].leaf_size);
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[size_class]++;
+#endif
+
+ return newnode;
+}
+
+/* Initialize the node contents */
+static inline void
+rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+{
+ if (inner)
+ MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+ else
+ MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+
+ node->kind = kind;
+ node->fanout = rt_size_class_info[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+
+ memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+ }
+
+ /*
+ * Technically it's 256, but we cannot store that in a uint8,
+ * and this is the max size class to it will never grow.
+ */
+ if (kind == RT_NODE_KIND_256)
+ node->fanout = 0;
+}
+
+static inline void
+rt_copy_node(rt_node *newnode, rt_node *oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->chunk = oldnode->chunk;
+ newnode->count = oldnode->count;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node*
+rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+{
+ rt_node *newnode;
+ bool inner = !NODE_IS_LEAF(node);
+
+ newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
+ rt_init_node(newnode, new_kind, kind_min_size_class[new_kind], inner);
+ rt_copy_node(newnode, node);
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->root == node)
+ {
+ tree->root = NULL;
+ tree->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == rt_size_class_info[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->cnt[i]--;
+ Assert(tree->cnt[i] >= 0);
+ }
+#endif
+
+ pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+ rt_node *new_child, uint64 key)
+{
+ Assert(old_child->chunk == new_child->chunk);
+ Assert(old_child->shift == new_child->shift);
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new large node */
+ tree->root = new_child;
+ }
+ else
+ {
+ bool replaced PG_USED_FOR_ASSERTS_ONLY;
+
+ replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+ Assert(replaced);
+ }
+
+ rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+ int target_shift;
+ int shift = tree->root->shift + RT_NODE_SPAN;
+
+ target_shift = key_get_shift(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ rt_node_inner_4 *node;
+
+ node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+ rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+ node->base.n.shift = shift;
+ node->base.n.count = 1;
+ node->base.chunks[0] = 0;
+ node->children[0] = tree->root;
+
+ tree->root->chunk = 0;
+ tree->root = (rt_node *) node;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+ rt_node *node)
+{
+ int shift = node->shift;
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ rt_node *newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool inner = newshift > 0;
+
+ newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+ rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newchild->shift = newshift;
+ newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ rt_node_insert_inner(tree, parent, node, key, newchild);
+
+ parent = node;
+ node = newchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ rt_node_insert_leaf(tree, parent, node, key, value);
+ tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ rt_node *child = NULL;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = n4->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n4->base.chunks, n4->children,
+ n4->base.n.count, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = n32->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = node_inner_125_get_child(n125, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_125_delete(n125, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = node_inner_256_get_child(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ if (found && child_p)
+ *child_p = child;
+
+ return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ uint64 value = 0;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = n4->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+ n4->base.n.count, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = n32->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+ n32->base.n.count, idx);
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_125_get_value(n125, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_125_delete(n125, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_256_get_value(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ if (found && value_p)
+ *value_p = value;
+
+ return found;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+ rt_node *child)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->children[idx] = child;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ {
+ rt_node_inner_32 *new32;
+ Assert(parent != NULL);
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_children_array_copy(n4->base.chunks, n4->children,
+ new32->base.chunks, new32->children);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+ key);
+ node = (rt_node *) new32;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ uint16 count = n4->base.n.count;
+
+ /* shift chunks and children */
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n4->base.chunks, n4->children,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->children[insertpos] = child;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->children[idx] = child;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ {
+ Assert(parent != NULL);
+
+ if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+ {
+ /* use the same node kind, but expand to the next size class */
+ const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size;
+ const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+ rt_node_inner_32 *new32;
+
+ new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+ memcpy(new32, n32, size);
+ new32->base.n.fanout = fanout;
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+ /* must update both pointers here */
+ node = (rt_node *) new32;
+ n32 = new32;
+
+ goto retry_insert_inner_32;
+ }
+ else
+ {
+ rt_node_inner_125 *new125;
+
+ /* grow node from 32 to 125 */
+ new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+ RT_NODE_KIND_125);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
+ node = (rt_node *) new125;
+ }
+ }
+ else
+ {
+retry_insert_inner_32:
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int16 count = n32->base.n.count;
+
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n32->base.chunks, n32->children,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->children[insertpos] = child;
+ break;
+ }
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+ int cnt = 0;
+
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ node_inner_125_update(n125, chunk, child);
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ {
+ rt_node_inner_256 *new256;
+ Assert(parent != NULL);
+
+ /* grow node from 125 to 256 */
+ new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ continue;
+
+ node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
+ cnt++;
+ }
+
+ rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ else
+ {
+ node_inner_125_insert(n125, chunk, child);
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+ node_inner_256_set(n256, chunk, child);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(NODE_IS_LEAF(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->values[idx] = value;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ {
+ rt_node_leaf_32 *new32;
+ Assert(parent != NULL);
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_values_array_copy(n4->base.chunks, n4->values,
+ new32->base.chunks, new32->values);
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
+ node = (rt_node *) new32;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ int count = n4->base.n.count;
+
+ /* shift chunks and values */
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n4->base.chunks, n4->values,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->values[insertpos] = value;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->values[idx] = value;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ {
+ Assert(parent != NULL);
+
+ if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+ {
+ /* use the same node kind, but expand to the next size class */
+ const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
+ const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+ rt_node_leaf_32 *new32;
+
+ new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+ memcpy(new32, n32, size);
+ new32->base.n.fanout = fanout;
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+ /* must update both pointers here */
+ node = (rt_node *) new32;
+ n32 = new32;
+
+ goto retry_insert_leaf_32;
+ }
+ else
+ {
+ rt_node_leaf_125 *new125;
+
+ /* grow node from 32 to 125 */
+ new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+ RT_NODE_KIND_125);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
+ key);
+ node = (rt_node *) new125;
+ }
+ }
+ else
+ {
+ retry_insert_leaf_32:
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int count = n32->base.n.count;
+
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n32->base.chunks, n32->values,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->values[insertpos] = value;
+ break;
+ }
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+ int cnt = 0;
+
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ node_leaf_125_update(n125, chunk, value);
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ {
+ rt_node_leaf_256 *new256;
+ Assert(parent != NULL);
+
+ /* grow node from 125 to 256 */
+ new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ continue;
+
+ node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
+ cnt++;
+ }
+
+ rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ else
+ {
+ node_leaf_125_insert(n125, chunk, value);
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+ node_leaf_256_set(n256, chunk, value);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+ radix_tree *tree;
+ MemoryContext old_ctx;
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = palloc(sizeof(radix_tree));
+ tree->context = ctx;
+ tree->root = NULL;
+ tree->max_val = 0;
+ tree->num_keys = 0;
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].leaf_blocksize,
+ rt_size_class_info[i].leaf_size);
+#ifdef RT_DEBUG
+ tree->cnt[i] = 0;
+#endif
+ }
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+ int shift;
+ bool updated;
+ rt_node *node;
+ rt_node *parent;
+
+ /* Empty tree, create the root */
+ if (!tree->root)
+ rt_new_root(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->max_val)
+ rt_extend(tree, key);
+
+ Assert(tree->root);
+
+ shift = tree->root->shift;
+ node = parent = tree->root;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ {
+ rt_set_extend(tree, key, value, parent, node);
+ return false;
+ }
+
+ parent = node;
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->num_keys++;
+
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+ rt_node *node;
+ int shift;
+
+ Assert(value_p != NULL);
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ node = tree->root;
+ shift = tree->root->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ rt_node *stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ node = tree->root;
+ shift = tree->root->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ rt_node *child;
+
+ /* Push the current node to the stack */
+ stack[++level] = node;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ Assert(NODE_IS_LEAF(node));
+ deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (!NODE_IS_EMPTY(node))
+ return true;
+
+ /* Free the empty leaf node */
+ rt_free_node(tree, node);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ node = stack[level--];
+
+ deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!NODE_IS_EMPTY(node))
+ break;
+
+ /* The node became empty */
+ rt_free_node(tree, node);
+ }
+
+ return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+ MemoryContext old_ctx;
+ rt_iter *iter;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (rt_iter *) palloc0(sizeof(rt_iter));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree->root)
+ return iter;
+
+ top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+ int level = from;
+ rt_node *node = from_node;
+
+ for (;;)
+ {
+ rt_node_iter *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = rt_node_inner_iterate_next(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->root)
+ return false;
+
+ for (;;)
+ {
+ rt_node *child = NULL;
+ uint64 value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ rt_update_iter_stack(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+ pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+ rt_node *child = NULL;
+ bool found = false;
+ uint8 key_chunk;
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+
+ child = n4->children[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+ child = n32->children[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_125_get_child(n125, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_inner_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_256_get_child(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+ return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p)
+{
+ rt_node *node = node_iter->node;
+ bool found = false;
+ uint64 value;
+ uint8 key_chunk;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+
+ value = n4->values[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+ value = n32->values[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_125_get_value(n125, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_leaf_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_256_get_value(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ *value_p = value;
+ }
+
+ return found;
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+ return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+ Size total = sizeof(radix_tree);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+ for (int i = 1; i < n4->n.count; i++)
+ Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ if (NODE_IS_LEAF(node))
+ Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) node,
+ n125->slot_idxs[i]));
+ else
+ Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) node,
+ n125->slot_idxs[i]));
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+ cnt += pg_popcount32(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+ ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+ tree->num_keys,
+ tree->root->shift / RT_NODE_SPAN,
+ tree->cnt[RT_CLASS_4_FULL],
+ tree->cnt[RT_CLASS_32_PARTIAL],
+ tree->cnt[RT_CLASS_32_FULL],
+ tree->cnt[RT_CLASS_125_FULL],
+ tree->cnt[RT_CLASS_256])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+ char space[125] = {0};
+
+ fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
+ NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift, node->chunk);
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n4->base.chunks[i], n4->values[i]);
+ }
+ else
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n4->base.chunks[i]);
+
+ if (recurse)
+ rt_dump_node(n4->children[i], level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n32->base.chunks[i], n32->values[i]);
+ }
+ else
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ rt_dump_node(n32->children[i], level + 1, recurse);
+ }
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+
+ fprintf(stderr, "slot_idxs ");
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used(b125, i))
+ continue;
+
+ fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+ }
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+
+ fprintf(stderr, ", isset-bitmap:");
+ for (int i = 0; i < 16; i++)
+ {
+ fprintf(stderr, "%X ", (uint8) n->isset[i]);
+ }
+ fprintf(stderr, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used(b125, i))
+ continue;
+
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, node_leaf_125_get_value(n125, i));
+ }
+ else
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_125_get_child(n125, i),
+ level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, node_leaf_256_get_value(n256, i));
+ }
+ else
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+ recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+ tree->max_val, tree->max_val);
+
+ if (!tree->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->max_val)
+ {
+ elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+ key, key);
+ return;
+ }
+
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ rt_dump_node(node, level, false);
+
+ if (NODE_IS_LEAF(node))
+ {
+ uint64 dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+ break;
+ }
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_size,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].leaf_size,
+ rt_size_class_info[i].leaf_blocksize);
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+
+ if (!tree->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif /* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 96addded81..11d0ec5b07 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -27,6 +27,7 @@ SUBDIRS = \
test_parser \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1d26544854..568823b221 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -21,6 +21,7 @@ subdir('test_oat_hooks')
subdir('test_parser')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..ea993e63df
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,581 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int rt_node_kind_fanouts[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 125, /* RT_NODE_KIND_125 */
+ 256 /* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ uint64 dummy;
+ uint64 key;
+ uint64 val;
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_set(radixtree, keys[i], keys[i] + 1))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ uint64 val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx - 1]
+ : rt_node_kind_fanouts[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx]
+ : rt_node_kind_fanouts[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ radix_tree *radixtree;
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+ radixtree = rt_create(radixtree_ctx);
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ test_basic(rt_node_kind_fanouts[i], false);
+ test_basic(rt_node_kind_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
--
2.31.1
v12-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/octet-stream; name=v12-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload
From 4468f93f23b2900392b1510b8e572ca6e14a9dbd Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v12 1/7] introduce vector8_min and vector8_highbit_mask
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..0b288c422a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
static inline bool vector8_has_zero(const Vector8 v);
static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
#endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
#endif
}
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+ uint32 mask = 0;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+ return mask;
+#endif
+}
+
/*
* Exactly like vector8_is_highbit_set except for the input type, so it
* looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.31.1
On Fri, Dec 2, 2022 at 11:42 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Mon, Nov 14, 2022 at 7:59 PM John Naylor <
john.naylor@enterprisedb.com> wrote:
- Optimize node128 insert.
I've attached a rough start at this. The basic idea is borrowed from
our bitmapset nodes, so we can iterate over and operate on word-sized (32-
or 64-bit) types at a time, rather than bytes.
Thanks! I think this is a good idea.
To make this easier, I've moved some of the lower-level macros and
types from bitmapset.h/.c to pg_bitutils.h. That's probably going to need a
separate email thread to resolve the coding style clash this causes, so
that can be put off for later.
I started a separate thread [1]/messages/by-id/CAFBsxsFW2JjTo58jtDB+3sZhxMx3t-3evew8=Acr+GGhC+kFaA@mail.gmail.com, and 0002 comes from feedback on that.
There is a FIXME about using WORDNUM and BITNUM, at least with that
spelling. I'm putting that off to ease rebasing the rest as v13 -- getting
some CI testing with 0002 seems like a good idea. There are no other
changes yet. Next, I will take a look at templating local vs. shared
memory. I might try basing that on the styles of both v12 and v8, and see
which one works best with templating.
[1]: /messages/by-id/CAFBsxsFW2JjTo58jtDB+3sZhxMx3t-3evew8=Acr+GGhC+kFaA@mail.gmail.com
/messages/by-id/CAFBsxsFW2JjTo58jtDB+3sZhxMx3t-3evew8=Acr+GGhC+kFaA@mail.gmail.com
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v13-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v13-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload
From 1dc766a6a33ba379c27c15677b7ec2c02384ba8e Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v13 2/8] Move some bitmap logic out of bitmapset.c
Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.
Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().
Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.
Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
src/backend/nodes/bitmapset.c | 36 ++------------------------------
src/include/nodes/bitmapset.h | 16 ++++++++++++--
src/include/port/pg_bitutils.h | 31 +++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 -
4 files changed, 47 insertions(+), 37 deletions(-)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index b7b274aeff..4384ff591d 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x))
/*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
{
int result;
- w = RIGHTMOST_ONE(w);
+ w = bmw_rightmost_one(w);
a->words[wordnum] &= ~w;
result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 2792281658..fdc504596b 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
#else
#define BITS_PER_BITMAPWORD 32
typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
#endif
@@ -75,6 +73,20 @@ typedef enum
BMS_MULTIPLE /* >1 member */
} BMS_Membership;
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w) pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
+#define bmw_popcount(w) pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w) pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
+#define bmw_popcount(w) pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
/*
* function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 814e0b2dba..f95b6afd86 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+ int32 result = (int32) word & -((int32) word);
+ return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+ int64 result = (int64) word & -((int64) word);
+ return (uint64) result;
+}
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 58daeca831..68df6ddc0b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3651,7 +3651,6 @@ shmem_request_hook_type
shmem_startup_hook_type
sig_atomic_t
sigjmp_buf
-signedbitmapword
sigset_t
size_t
slist_head
--
2.38.1
v13-0004-Use-bitmapword-for-node-125.patchtext/x-patch; charset=US-ASCII; name=v13-0004-Use-bitmapword-for-node-125.patchDownload
From bacc9b9ced17faeb868a5e5684c5016ffcc68ff6 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 15:22:26 +0700
Subject: [PATCH v13 4/8] Use bitmapword for node-125
TODO: Rename macros copied from bitmapset.c
---
src/backend/lib/radixtree.c | 70 ++++++++++++++++++-------------------
1 file changed, 34 insertions(+), 36 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index e7f61fd943..abd0450727 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -62,6 +62,7 @@
#include "lib/radixtree.h"
#include "lib/stringinfo.h"
#include "miscadmin.h"
+#include "nodes/bitmapset.h"
#include "port/pg_bitutils.h"
#include "port/pg_lfind.h"
#include "utils/memutils.h"
@@ -103,6 +104,10 @@
#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+/* FIXME rename */
+#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD)
+
/* Enum used rt_node_search() */
typedef enum
{
@@ -207,6 +212,9 @@ typedef struct rt_node_base125
/* The index of slots for each fanout */
uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[WORDNUM(128)];
} rt_node_base_125;
typedef struct rt_node_base256
@@ -271,9 +279,6 @@ typedef struct rt_node_leaf_125
{
rt_node_base_125 base;
- /* isset is a bitmap to track which slot is in use */
- uint8 isset[RT_NODE_NSLOTS_BITS(128)];
-
/* number of values depends on size class */
uint64 values[FLEXIBLE_ARRAY_MEMBER];
} rt_node_leaf_125;
@@ -655,13 +660,14 @@ node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
}
+#ifdef USE_ASSERT_CHECKING
/* Is the slot in the node used? */
static inline bool
node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
{
Assert(!NODE_IS_LEAF(node));
Assert(slot < node->base.n.fanout);
- return (node->children[slot] != NULL);
+ return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
}
static inline bool
@@ -669,8 +675,9 @@ node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
{
Assert(NODE_IS_LEAF(node));
Assert(slot < node->base.n.fanout);
- return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+ return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
}
+#endif
static inline rt_node *
node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
@@ -690,7 +697,10 @@ node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
static void
node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
{
+ int slotpos = node->base.slot_idxs[chunk];
+
Assert(!NODE_IS_LEAF(node));
+ node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
node->children[node->base.slot_idxs[chunk]] = NULL;
node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
}
@@ -701,44 +711,35 @@ node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
int slotpos = node->base.slot_idxs[chunk];
Assert(NODE_IS_LEAF(node));
- node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+ node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
}
/* Return an unused slot in node-125 */
static int
-node_inner_125_find_unused_slot(rt_node_inner_125 *node, uint8 chunk)
-{
- int slotpos = 0;
-
- Assert(!NODE_IS_LEAF(node));
- while (node_inner_125_is_slot_used(node, slotpos))
- slotpos++;
-
- return slotpos;
-}
-
-static int
-node_leaf_125_find_unused_slot(rt_node_leaf_125 *node, uint8 chunk)
+node_125_find_unused_slot(bitmapword *isset)
{
int slotpos;
+ int idx;
+ bitmapword inverse;
- Assert(NODE_IS_LEAF(node));
-
- /* We iterate over the isset bitmap per byte then check each bit */
- for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < WORDNUM(128); idx++)
{
- if (node->isset[slotpos] < 0xFF)
+ if (isset[idx] < ~((bitmapword) 0))
break;
}
- Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
- slotpos *= BITS_PER_BYTE;
- while (node_leaf_125_is_slot_used(node, slotpos))
- slotpos++;
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+
+ /* mark the slot used */
+ isset[idx] |= bmw_rightmost_one(inverse);
return slotpos;
-}
+ }
static inline void
node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
@@ -747,8 +748,7 @@ node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
Assert(!NODE_IS_LEAF(node));
- /* find unused slot */
- slotpos = node_inner_125_find_unused_slot(node, chunk);
+ slotpos = node_125_find_unused_slot(node->base.isset);
Assert(slotpos < node->base.n.fanout);
node->base.slot_idxs[chunk] = slotpos;
@@ -763,12 +763,10 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
Assert(NODE_IS_LEAF(node));
- /* find unused slot */
- slotpos = node_leaf_125_find_unused_slot(node, chunk);
+ slotpos = node_125_find_unused_slot(node->base.isset);
Assert(slotpos < node->base.n.fanout);
node->base.slot_idxs[chunk] = slotpos;
- node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
node->values[slotpos] = value;
}
@@ -2395,9 +2393,9 @@ rt_dump_node(rt_node *node, int level, bool recurse)
rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
fprintf(stderr, ", isset-bitmap:");
- for (int i = 0; i < 16; i++)
+ for (int i = 0; i < WORDNUM(128); i++)
{
- fprintf(stderr, "%X ", (uint8) n->isset[i]);
+ fprintf(stderr, UINT64_FORMAT_HEX " ", n->base.isset[i]);
}
fprintf(stderr, "\n");
}
--
2.38.1
v13-0003-Add-radix-implementation.patchtext/x-patch; charset=US-ASCII; name=v13-0003-Add-radix-implementation.patchDownload
From 377cc13755e9129e672e72deaccc2f8d36fe8fa5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v13 3/8] Add radix implementation.
---
src/backend/lib/Makefile | 1 +
src/backend/lib/meson.build | 1 +
src/backend/lib/radixtree.c | 2541 +++++++++++++++++
src/include/lib/radixtree.h | 42 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 36 +
src/test/modules/test_radixtree/meson.build | 34 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 581 ++++
.../test_radixtree/test_radixtree.control | 4 +
15 files changed, 3291 insertions(+)
create mode 100644 src/backend/lib/radixtree.c
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
integerset.o \
knapsack.o \
pairingheap.o \
+ radixtree.o \
rbtree.o \
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 48da1bddce..4303d306cd 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -9,4 +9,5 @@ backend_sources += files(
'knapsack.c',
'pairingheap.c',
'rbtree.c',
+ 'radixtree.c',
)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..e7f61fd943
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2541 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves". We
+ * choose it to avoid an additional pointer traversal. It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create - Create a new, empty radix tree
+ * rt_free - Free the radix tree
+ * rt_search - Search a key-value pair
+ * rt_set - Set a key-value pair
+ * rt_delete - Delete a key-value pair
+ * rt_begin_iterate - Begin iterating through all key-value pairs
+ * rt_iterate_next - Return next key-value pair, if any
+ * rt_end_iter - End iteration
+ * rt_memory_usage - Get the memory usage
+ * rt_num_entries - Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+ RT_ACTION_FIND = 0, /* find the key-value */
+ RT_ACTION_DELETE, /* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of rt_node. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+typedef enum rt_size_class
+{
+ RT_CLASS_4_FULL = 0,
+ RT_CLASS_32_PARTIAL,
+ RT_CLASS_32_FULL,
+ RT_CLASS_125_FULL,
+ RT_CLASS_256
+
+#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
+} rt_size_class;
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /* Max number of children. We can use uint8 because we never need to store 256 */
+ /* WIP: if we don't have a variable sized node4, this should instead be in the base
+ types as needed, since saving every byte is crucial for the smallest node kind */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+ uint8 chunk;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} rt_node;
+#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+ ((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+ ((node)->base.n.count < rt_size_class_info[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct rt_node_base_4
+{
+ rt_node n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+ rt_node n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base125
+{
+ rt_node n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+} rt_node_base_125;
+
+typedef struct rt_node_base256
+{
+ rt_node n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ * width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+ rt_node_base_4 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+ rt_node_base_4 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+ rt_node_base_32 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+ rt_node_base_32 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_125
+{
+ rt_node_base_125 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_125;
+
+typedef struct rt_node_leaf_125
+{
+ rt_node_base_125 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(128)];
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+ rt_node_base_256 base;
+
+ /* Slots for 256 children */
+ rt_node *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+ rt_node_base_256 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ uint64 values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information for each size class */
+typedef struct rt_size_class_elem
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+
+ /* slab block size */
+ Size inner_blocksize;
+ Size leaf_blocksize;
+} rt_size_class_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
+ [RT_CLASS_4_FULL] = {
+ .name = "radix tree node 4",
+ .fanout = 4,
+ .inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_PARTIAL] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_FULL] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+ },
+ [RT_CLASS_125_FULL] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(rt_node_inner_256),
+ .leaf_size = sizeof(rt_node_leaf_256),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+ },
+};
+
+/* Map from the node kind to its minimum size class */
+static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
+ [RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+ [RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+ [RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+ [RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+ rt_node *node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+ radix_tree *tree;
+
+ /* Track the iteration on nodes of each level */
+ rt_node_iter stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ rt_node *root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+ bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+ rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+ uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+ uint8 *dst_chunks, rt_node **dst_children)
+{
+ const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(rt_node *) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+ uint8 *dst_chunks, uint64 *dst_values)
+{
+ const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(uint64) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(slot < node->base.n.fanout);
+ return (node->children[slot] != NULL);
+}
+
+static inline bool
+node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(slot < node->base.n.fanout);
+ return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+static inline rt_node *
+node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+static void
+node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[node->base.slot_idxs[chunk]] = NULL;
+ node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+static void
+node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
+{
+ int slotpos = node->base.slot_idxs[chunk];
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+ node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+/* Return an unused slot in node-125 */
+static int
+node_inner_125_find_unused_slot(rt_node_inner_125 *node, uint8 chunk)
+{
+ int slotpos = 0;
+
+ Assert(!NODE_IS_LEAF(node));
+ while (node_inner_125_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+static int
+node_leaf_125_find_unused_slot(rt_node_leaf_125 *node, uint8 chunk)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ /* We iterate over the isset bitmap per byte then check each bit */
+ for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+ {
+ if (node->isset[slotpos] < 0xFF)
+ break;
+ }
+ Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+ slotpos *= BITS_PER_BYTE;
+ while (node_leaf_125_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+static inline void
+node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+ int slotpos;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ /* find unused slot */
+ slotpos = node_inner_125_find_unused_slot(node, chunk);
+ Assert(slotpos < node->base.n.fanout);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ /* find unused slot */
+ slotpos = node_leaf_125_find_unused_slot(node, chunk);
+ Assert(slotpos < node->base.n.fanout);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+ node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+static inline void
+node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(node_inner_256_is_chunk_used(node, chunk));
+ return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(node_leaf_256_is_chunk_used(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+ int shift = key_get_shift(key);
+ bool inner = shift > 0;
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+ rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newnode->shift = shift;
+ tree->max_val = shift_get_max_val(shift);
+ tree->root = newnode;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
+{
+ rt_node *newnode;
+
+ if (inner)
+ newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+ rt_size_class_info[size_class].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ rt_size_class_info[size_class].leaf_size);
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[size_class]++;
+#endif
+
+ return newnode;
+}
+
+/* Initialize the node contents */
+static inline void
+rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+{
+ if (inner)
+ MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+ else
+ MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+
+ node->kind = kind;
+ node->fanout = rt_size_class_info[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+
+ memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+ }
+
+ /*
+ * Technically it's 256, but we cannot store that in a uint8,
+ * and this is the max size class to it will never grow.
+ */
+ if (kind == RT_NODE_KIND_256)
+ node->fanout = 0;
+}
+
+static inline void
+rt_copy_node(rt_node *newnode, rt_node *oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->chunk = oldnode->chunk;
+ newnode->count = oldnode->count;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node*
+rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+{
+ rt_node *newnode;
+ bool inner = !NODE_IS_LEAF(node);
+
+ newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
+ rt_init_node(newnode, new_kind, kind_min_size_class[new_kind], inner);
+ rt_copy_node(newnode, node);
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->root == node)
+ {
+ tree->root = NULL;
+ tree->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == rt_size_class_info[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->cnt[i]--;
+ Assert(tree->cnt[i] >= 0);
+ }
+#endif
+
+ pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+ rt_node *new_child, uint64 key)
+{
+ Assert(old_child->chunk == new_child->chunk);
+ Assert(old_child->shift == new_child->shift);
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new large node */
+ tree->root = new_child;
+ }
+ else
+ {
+ bool replaced PG_USED_FOR_ASSERTS_ONLY;
+
+ replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+ Assert(replaced);
+ }
+
+ rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+ int target_shift;
+ int shift = tree->root->shift + RT_NODE_SPAN;
+
+ target_shift = key_get_shift(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ rt_node_inner_4 *node;
+
+ node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+ rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+ node->base.n.shift = shift;
+ node->base.n.count = 1;
+ node->base.chunks[0] = 0;
+ node->children[0] = tree->root;
+
+ tree->root->chunk = 0;
+ tree->root = (rt_node *) node;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+ rt_node *node)
+{
+ int shift = node->shift;
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ rt_node *newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool inner = newshift > 0;
+
+ newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+ rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newchild->shift = newshift;
+ newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ rt_node_insert_inner(tree, parent, node, key, newchild);
+
+ parent = node;
+ node = newchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ rt_node_insert_leaf(tree, parent, node, key, value);
+ tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ rt_node *child = NULL;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = n4->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n4->base.chunks, n4->children,
+ n4->base.n.count, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = n32->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = node_inner_125_get_child(n125, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_125_delete(n125, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = node_inner_256_get_child(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ if (found && child_p)
+ *child_p = child;
+
+ return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ uint64 value = 0;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = n4->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+ n4->base.n.count, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = n32->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+ n32->base.n.count, idx);
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_125_get_value(n125, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_125_delete(n125, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_256_get_value(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ if (found && value_p)
+ *value_p = value;
+
+ return found;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+ rt_node *child)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->children[idx] = child;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ {
+ rt_node_inner_32 *new32;
+ Assert(parent != NULL);
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_children_array_copy(n4->base.chunks, n4->children,
+ new32->base.chunks, new32->children);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+ key);
+ node = (rt_node *) new32;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ uint16 count = n4->base.n.count;
+
+ /* shift chunks and children */
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n4->base.chunks, n4->children,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->children[insertpos] = child;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->children[idx] = child;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ {
+ Assert(parent != NULL);
+
+ if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+ {
+ /* use the same node kind, but expand to the next size class */
+ const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size;
+ const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+ rt_node_inner_32 *new32;
+
+ new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+ memcpy(new32, n32, size);
+ new32->base.n.fanout = fanout;
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+ /* must update both pointers here */
+ node = (rt_node *) new32;
+ n32 = new32;
+
+ goto retry_insert_inner_32;
+ }
+ else
+ {
+ rt_node_inner_125 *new125;
+
+ /* grow node from 32 to 125 */
+ new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+ RT_NODE_KIND_125);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
+ node = (rt_node *) new125;
+ }
+ }
+ else
+ {
+retry_insert_inner_32:
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int16 count = n32->base.n.count;
+
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n32->base.chunks, n32->children,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->children[insertpos] = child;
+ break;
+ }
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+ int cnt = 0;
+
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ node_inner_125_update(n125, chunk, child);
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ {
+ rt_node_inner_256 *new256;
+ Assert(parent != NULL);
+
+ /* grow node from 125 to 256 */
+ new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ continue;
+
+ node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
+ cnt++;
+ }
+
+ rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ else
+ {
+ node_inner_125_insert(n125, chunk, child);
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+ node_inner_256_set(n256, chunk, child);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(NODE_IS_LEAF(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->values[idx] = value;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ {
+ rt_node_leaf_32 *new32;
+ Assert(parent != NULL);
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_values_array_copy(n4->base.chunks, n4->values,
+ new32->base.chunks, new32->values);
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
+ node = (rt_node *) new32;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ int count = n4->base.n.count;
+
+ /* shift chunks and values */
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n4->base.chunks, n4->values,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->values[insertpos] = value;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->values[idx] = value;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ {
+ Assert(parent != NULL);
+
+ if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+ {
+ /* use the same node kind, but expand to the next size class */
+ const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
+ const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+ rt_node_leaf_32 *new32;
+
+ new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+ memcpy(new32, n32, size);
+ new32->base.n.fanout = fanout;
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+ /* must update both pointers here */
+ node = (rt_node *) new32;
+ n32 = new32;
+
+ goto retry_insert_leaf_32;
+ }
+ else
+ {
+ rt_node_leaf_125 *new125;
+
+ /* grow node from 32 to 125 */
+ new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+ RT_NODE_KIND_125);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
+ key);
+ node = (rt_node *) new125;
+ }
+ }
+ else
+ {
+ retry_insert_leaf_32:
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int count = n32->base.n.count;
+
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n32->base.chunks, n32->values,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->values[insertpos] = value;
+ break;
+ }
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+ int cnt = 0;
+
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ node_leaf_125_update(n125, chunk, value);
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ {
+ rt_node_leaf_256 *new256;
+ Assert(parent != NULL);
+
+ /* grow node from 125 to 256 */
+ new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ continue;
+
+ node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
+ cnt++;
+ }
+
+ rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ else
+ {
+ node_leaf_125_insert(n125, chunk, value);
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+ node_leaf_256_set(n256, chunk, value);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+ radix_tree *tree;
+ MemoryContext old_ctx;
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = palloc(sizeof(radix_tree));
+ tree->context = ctx;
+ tree->root = NULL;
+ tree->max_val = 0;
+ tree->num_keys = 0;
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].leaf_blocksize,
+ rt_size_class_info[i].leaf_size);
+#ifdef RT_DEBUG
+ tree->cnt[i] = 0;
+#endif
+ }
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+ int shift;
+ bool updated;
+ rt_node *node;
+ rt_node *parent;
+
+ /* Empty tree, create the root */
+ if (!tree->root)
+ rt_new_root(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->max_val)
+ rt_extend(tree, key);
+
+ Assert(tree->root);
+
+ shift = tree->root->shift;
+ node = parent = tree->root;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ {
+ rt_set_extend(tree, key, value, parent, node);
+ return false;
+ }
+
+ parent = node;
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->num_keys++;
+
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+ rt_node *node;
+ int shift;
+
+ Assert(value_p != NULL);
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ node = tree->root;
+ shift = tree->root->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ rt_node *stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ node = tree->root;
+ shift = tree->root->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ rt_node *child;
+
+ /* Push the current node to the stack */
+ stack[++level] = node;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ Assert(NODE_IS_LEAF(node));
+ deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (!NODE_IS_EMPTY(node))
+ return true;
+
+ /* Free the empty leaf node */
+ rt_free_node(tree, node);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ node = stack[level--];
+
+ deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!NODE_IS_EMPTY(node))
+ break;
+
+ /* The node became empty */
+ rt_free_node(tree, node);
+ }
+
+ return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+ MemoryContext old_ctx;
+ rt_iter *iter;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (rt_iter *) palloc0(sizeof(rt_iter));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree->root)
+ return iter;
+
+ top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+ int level = from;
+ rt_node *node = from_node;
+
+ for (;;)
+ {
+ rt_node_iter *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = rt_node_inner_iterate_next(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->root)
+ return false;
+
+ for (;;)
+ {
+ rt_node *child = NULL;
+ uint64 value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ rt_update_iter_stack(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+ pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+ rt_node *child = NULL;
+ bool found = false;
+ uint8 key_chunk;
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+
+ child = n4->children[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+ child = n32->children[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_125_get_child(n125, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_inner_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_256_get_child(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+ return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p)
+{
+ rt_node *node = node_iter->node;
+ bool found = false;
+ uint64 value;
+ uint8 key_chunk;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+
+ value = n4->values[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+ value = n32->values[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_125_get_value(n125, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_leaf_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_256_get_value(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ *value_p = value;
+ }
+
+ return found;
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+ return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+ Size total = sizeof(radix_tree);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+ for (int i = 1; i < n4->n.count; i++)
+ Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ if (NODE_IS_LEAF(node))
+ Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) node,
+ n125->slot_idxs[i]));
+ else
+ Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) node,
+ n125->slot_idxs[i]));
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+ cnt += pg_popcount32(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+ ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+ tree->num_keys,
+ tree->root->shift / RT_NODE_SPAN,
+ tree->cnt[RT_CLASS_4_FULL],
+ tree->cnt[RT_CLASS_32_PARTIAL],
+ tree->cnt[RT_CLASS_32_FULL],
+ tree->cnt[RT_CLASS_125_FULL],
+ tree->cnt[RT_CLASS_256])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+ char space[125] = {0};
+
+ fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
+ NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift, node->chunk);
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n4->base.chunks[i], n4->values[i]);
+ }
+ else
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n4->base.chunks[i]);
+
+ if (recurse)
+ rt_dump_node(n4->children[i], level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n32->base.chunks[i], n32->values[i]);
+ }
+ else
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ rt_dump_node(n32->children[i], level + 1, recurse);
+ }
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+
+ fprintf(stderr, "slot_idxs ");
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used(b125, i))
+ continue;
+
+ fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+ }
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+
+ fprintf(stderr, ", isset-bitmap:");
+ for (int i = 0; i < 16; i++)
+ {
+ fprintf(stderr, "%X ", (uint8) n->isset[i]);
+ }
+ fprintf(stderr, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used(b125, i))
+ continue;
+
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, node_leaf_125_get_value(n125, i));
+ }
+ else
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_125_get_child(n125, i),
+ level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, node_leaf_256_get_value(n256, i));
+ }
+ else
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+ recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+ tree->max_val, tree->max_val);
+
+ if (!tree->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->max_val)
+ {
+ elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+ key, key);
+ return;
+ }
+
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ rt_dump_node(node, level, false);
+
+ if (NODE_IS_LEAF(node))
+ {
+ uint64 dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+ break;
+ }
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_size,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].leaf_size,
+ rt_size_class_info[i].leaf_blocksize);
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+
+ if (!tree->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif /* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 96addded81..11d0ec5b07 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -27,6 +27,7 @@ SUBDIRS = \
test_parser \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1d26544854..568823b221 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -21,6 +21,7 @@ subdir('test_oat_hooks')
subdir('test_parser')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..ea993e63df
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,581 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int rt_node_kind_fanouts[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 125, /* RT_NODE_KIND_125 */
+ 256 /* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ uint64 dummy;
+ uint64 key;
+ uint64 val;
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_set(radixtree, keys[i], keys[i] + 1))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ uint64 val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx - 1]
+ : rt_node_kind_fanouts[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx]
+ : rt_node_kind_fanouts[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ radix_tree *radixtree;
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+ radixtree = rt_create(radixtree_ctx);
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ test_basic(rt_node_kind_fanouts[i], false);
+ test_basic(rt_node_kind_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
--
2.38.1
v13-0001-introduce-vector8_min-and-vector8_highbit_mask.patchtext/x-patch; charset=US-ASCII; name=v13-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload
From 3b3d8b87123413bfc04ece39bdfbfdd784b3a02c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v13 1/8] introduce vector8_min and vector8_highbit_mask
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..0b288c422a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
static inline bool vector8_has_zero(const Vector8 v);
static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
#endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
#endif
}
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+ uint32 mask = 0;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+ return mask;
+#endif
+}
+
/*
* Exactly like vector8_is_highbit_set except for the input type, so it
* looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.38.1
v13-0005-tool-for-measuring-radix-tree-performance.patchtext/x-patch; charset=US-ASCII; name=v13-0005-tool-for-measuring-radix-tree-performance.patchDownload
From 3c4009682a186e3826803db8fb859cde527c6e76 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v13 5/8] tool for measuring radix tree performance
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 76 +++
contrib/bench_radix_tree/bench_radix_tree.c | 635 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
6 files changed, 767 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..83529805fc
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..a0693695e6
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,635 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 search_time_ms;
+ Datum values[2] = {0};
+ bool nulls[2] = {0};
+ /* from trial and error */
+ uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (!PG_ARGISNULL(1))
+ {
+ char *filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+ if (sscanf(filter_str, "0x%lX", &filter) == 0)
+ elog(ERROR, "invalid filter string %s", filter_str);
+ }
+ elog(NOTICE, "bench with filter 0x%lX", filter);
+
+ rt = rt_create(CurrentMemoryContext);
+
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+
+ rt_set(rt, key, key);
+ }
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_sparseload_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ uint64 r,
+ h;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ key_id = 0;
+
+ for (r = 1; r <= fanout; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (h);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+
+ rt_stats(rt);
+
+ /* measure sparse deletion and re-loading */
+ start_time = GetCurrentTimestamp();
+
+ for (int t = 0; t<10000; t++)
+ {
+ /* delete one key in each leaf */
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ rt_delete(rt, key);
+ }
+
+ /* add them all back */
+ key_id = 0;
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_sparseload_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
--
2.38.1
v13-0008-PoC-lazy-vacuum-integration.patchtext/x-patch; charset=US-ASCII; name=v13-0008-PoC-lazy-vacuum-integration.patchDownload
From 5f20ef14890f10cfd4290fa212440ea8a10dd318 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 4 Nov 2022 14:14:42 +0900
Subject: [PATCH v13 8/8] PoC: lazy vacuum integration.
The patch includes:
* Introducing a new module called TIDStore
* Lazy vacuum and parallel vacuum integration.
TODOs:
* radix tree needs to have the reset funtionality.
* should not allow TIDStore to grow beyond the memory limit.
* change the progress statistics of pg_stat_progress_vacuum.
---
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 448 ++++++++++++++++++++++++++
src/backend/access/heap/vacuumlazy.c | 164 +++-------
src/backend/commands/vacuum.c | 76 +----
src/backend/commands/vacuumparallel.c | 63 ++--
src/backend/storage/lmgr/lwlock.c | 2 +
src/include/access/tidstore.h | 60 ++++
src/include/commands/vacuum.h | 24 +-
src/include/storage/lwlock.h | 1 +
10 files changed, 612 insertions(+), 228 deletions(-)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index 857beaa32d..76265974b1 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -13,6 +13,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..c3cf771f7d
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,448 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * TID (ItemPointer) storage implementation.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "lib/radixtree.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+#include "miscadmin.h"
+
+#define XXX_DEBUG_TID_STORE 1
+
+/* XXX: should be configurable for non-heap AMs */
+#define TIDSTORE_OFFSET_NBITS 11 /* pg_ceil_log2_32(MaxHeapTuplesPerPage) */
+
+#define TIDSTORE_VALUE_NBITS 6 /* log(sizeof(uint64) * BITS_PER_BYTE, 2) */
+
+/* Get block number from the key */
+#define KEY_GET_BLKNO(key) \
+ ((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+struct TIDStore
+{
+ /* main storage for TID */
+ radix_tree *tree;
+
+ /* # of tids in TIDStore */
+ int num_tids;
+
+ /* DSA area and handle for shared TIDStore */
+ rt_handle handle;
+ dsa_area *area;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ ItemPointer itemptrs;
+ uint64 nitems;
+#endif
+};
+
+static void tidstore_iter_collect_tids(TIDStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+/*
+ * Comparator routines for use with qsort() and bsearch().
+ */
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+
+static void
+verify_iter_tids(TIDStoreIter *iter)
+{
+ uint64 index = iter->prev_index;
+
+ if (iter->ts->itemptrs == NULL)
+ return;
+
+ Assert(index <= iter->ts->nitems);
+
+ for (int i = 0; i < iter->num_offsets; i++)
+ {
+ ItemPointerData tid;
+
+ ItemPointerSetBlockNumber(&tid, iter->blkno);
+ ItemPointerSetOffsetNumber(&tid, iter->offsets[i]);
+
+ Assert(ItemPointerEquals(&iter->ts->itemptrs[index++], &tid));
+ }
+
+ iter->prev_index = iter->itemptrs_index;
+}
+
+static void
+dump_itemptrs(TIDStore *ts)
+{
+ StringInfoData buf;
+
+ if (ts->itemptrs == NULL)
+ return;
+
+ initStringInfo(&buf);
+ for (int i = 0; i < ts->nitems; i++)
+ {
+ appendStringInfo(&buf, "(%d,%d) ",
+ ItemPointerGetBlockNumber(&(ts->itemptrs[i])),
+ ItemPointerGetOffsetNumber(&(ts->itemptrs[i])));
+ }
+ elog(WARNING, "--- dump (" UINT64_FORMAT " items) ---", ts->nitems);
+ elog(WARNING, "%s\n", buf.data);
+}
+
+#endif
+
+/*
+ * Create a TIDStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TIDStore *
+tidstore_create(dsa_area *area)
+{
+ TIDStore *ts;
+
+ ts = palloc0(sizeof(TIDStore));
+
+ ts->tree = rt_create(CurrentMemoryContext, area);
+ ts->area = area;
+
+ if (area != NULL)
+ ts->handle = rt_get_handle(ts->tree);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+#define MAXDEADITEMS(avail_mem) \
+ (avail_mem / sizeof(ItemPointerData))
+
+ if (area == NULL)
+ {
+ ts->itemptrs = (ItemPointer) palloc0(sizeof(ItemPointerData) *
+ MAXDEADITEMS(maintenance_work_mem * 1024));
+ ts->nitems = 0;
+ }
+#endif
+
+ return ts;
+}
+
+/* Attach to the shared TIDStore using a handle */
+TIDStore *
+tidstore_attach(dsa_area *area, rt_handle handle)
+{
+ TIDStore *ts;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ ts = palloc0(sizeof(TIDStore));
+ ts->tree = rt_attach(area, handle);
+
+ return ts;
+}
+
+/*
+ * Detach from a TIDStore. This detaches from radix tree and frees the
+ * backend-local resources.
+ */
+void
+tidstore_detach(TIDStore *ts)
+{
+ rt_detach(ts->tree);
+ pfree(ts);
+}
+
+void
+tidstore_free(TIDStore *ts)
+{
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ if (ts->itemptrs)
+ pfree(ts->itemptrs);
+#endif
+
+ rt_free(ts->tree);
+ pfree(ts);
+}
+
+void
+tidstore_reset(TIDStore *ts)
+{
+ dsa_area *area = ts->area;
+
+ /* Reset the statistics */
+ ts->num_tids = 0;
+
+ /* Recreate radix tree storage */
+ rt_free(ts->tree);
+ ts->tree = rt_create(CurrentMemoryContext, area);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ ts->nitems = 0;
+#endif
+}
+
+/* Add TIDs to TIDStore */
+void
+tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 key;
+ uint64 val = 0;
+ ItemPointerData tid;
+
+ ItemPointerSetBlockNumber(&tid, blkno);
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint32 off;
+
+ ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+ key = tid_to_key_off(&tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(ts->tree, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= UINT64CONST(1) << off;
+ ts->num_tids++;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ if (ts->itemptrs)
+ {
+ ItemPointerSetBlockNumber(&(ts->itemptrs[ts->nitems]), blkno);
+ ItemPointerSetOffsetNumber(&(ts->itemptrs[ts->nitems]), offsets[i]);
+ ts->nitems++;
+ }
+#endif
+ }
+
+ if (last_key != PG_UINT64_MAX)
+ {
+ rt_set(ts->tree, last_key, val);
+ val = 0;
+ }
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ if (ts->itemptrs)
+ Assert(ts->nitems == ts->num_tids);
+#endif
+}
+
+/* Return true if the given TID is present in TIDStore */
+bool
+tidstore_lookup_tid(TIDStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val;
+ uint32 off;
+ bool found;
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ bool found_assert;
+#endif
+
+ key = tid_to_key_off(tid, &off);
+
+ found = rt_search(ts->tree, key, &val);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ if (ts->itemptrs)
+ found_assert = bsearch((void *) tid,
+ (void *) ts->itemptrs,
+ ts->nitems,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr) != NULL;
+#endif
+
+ if (!found)
+ {
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ if (ts->itemptrs)
+ Assert(!found_assert);
+#endif
+ return false;
+ }
+
+ found = (val & (UINT64CONST(1) << off)) != 0;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+
+ if (ts->itemptrs && found != found_assert)
+ {
+ elog(WARNING, "tid (%d,%d)\n",
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
+ dump_itemptrs(ts);
+ }
+
+ if (ts->itemptrs)
+ Assert(found == found_assert);
+
+#endif
+ return found;
+}
+
+TIDStoreIter *
+tidstore_begin_iterate(TIDStore *ts)
+{
+ TIDStoreIter *iter;
+
+ iter = palloc0(sizeof(TIDStoreIter));
+ iter->ts = ts;
+ iter->tree_iter = rt_begin_iterate(ts->tree);
+ iter->blkno = InvalidBlockNumber;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ iter->itemptrs_index = 0;
+#endif
+
+ return iter;
+}
+
+bool
+tidstore_iterate_next(TIDStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+
+ if (iter->finished)
+ return false;
+
+ if (BlockNumberIsValid(iter->blkno))
+ {
+ iter->num_offsets = 0;
+ tidstore_iter_collect_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (rt_iterate_next(iter->tree_iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = KEY_GET_BLKNO(key);
+
+ if (BlockNumberIsValid(iter->blkno) && iter->blkno != blkno)
+ {
+ /*
+ * Remember the key-value pair for the next block for the
+ * next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ verify_iter_tids(iter);
+#endif
+ return true;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_collect_tids(iter, key, val);
+ }
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ verify_iter_tids(iter);
+#endif
+
+ iter->finished = true;
+ return true;
+}
+
+uint64
+tidstore_num_tids(TIDStore *ts)
+{
+ return ts->num_tids;
+}
+
+uint64
+tidstore_memory_usage(TIDStore *ts)
+{
+ return (uint64) sizeof(TIDStore) + rt_memory_usage(ts->tree);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TIDStore
+ */
+tidstore_handle
+tidstore_get_handle(TIDStore *ts)
+{
+ return rt_get_handle(ts->tree);
+}
+
+/* Extract TIDs from key-value pair */
+static void
+tidstore_iter_collect_tids(TIDStoreIter *iter, uint64 key, uint64 val)
+{
+ for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ if ((val & (UINT64CONST(1) << i)) == 0)
+ continue;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= i;
+
+ off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+ iter->offsets[iter->num_offsets++] = off;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ iter->itemptrs_index++;
+#endif
+ }
+
+ iter->blkno = KEY_GET_BLKNO(key);
+}
+
+/* Encode a TID to key and val */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint64 tid_i;
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+ *off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+ upper = tid_i >> TIDSTORE_VALUE_NBITS;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ return upper;
+}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d59711b7ec..75dead6c14 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -144,6 +145,8 @@ typedef struct LVRelState
Relation *indrels;
int nindexes;
+ int max_bytes;
+
/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
bool aggressive;
/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -194,7 +197,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TIDStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -265,8 +268,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer *vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer *vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -392,6 +396,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
vacrel->indname = NULL;
vacrel->phase = VACUUM_ERRCB_PHASE_UNKNOWN;
vacrel->verbose = verbose;
+ vacrel->max_bytes = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
errcallback.callback = vacuum_error_callback;
errcallback.arg = vacrel;
errcallback.previous = error_context_stack;
@@ -853,7 +860,7 @@ lazy_scan_heap(LVRelState *vacrel)
next_unskippable_block,
next_failsafe_block = 0,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TIDStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
@@ -867,7 +874,7 @@ lazy_scan_heap(LVRelState *vacrel)
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = vacrel->max_bytes; /* XXX: should use # of tids */
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -937,8 +944,8 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ /* XXX: should not allow tidstore to grow beyond max_bytes */
+ if (tidstore_memory_usage(vacrel->dead_items) > vacrel->max_bytes)
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -1070,11 +1077,17 @@ lazy_scan_heap(LVRelState *vacrel)
if (prunestate.has_lpdead_items)
{
Size freespace;
+ TIDStoreIter *iter;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ tidstore_iterate_next(iter);
+ lazy_vacuum_heap_page(vacrel, blkno, iter->offsets, iter->num_offsets,
+ buf, &vmbuffer);
+ Assert(!tidstore_iterate_next(iter));
+ pfree(iter);
/* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ tidstore_reset(dead_items);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1111,7 +1124,7 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(tidstore_num_tids(dead_items) == 0);
}
/*
@@ -1264,7 +1277,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (tidstore_num_tids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1863,25 +1876,16 @@ retry:
*/
if (lpdead_items > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TIDStore *dead_items = vacrel->dead_items;
Assert(!prunestate->all_visible);
Assert(prunestate->has_lpdead_items);
vacrel->lpdead_item_pages++;
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ tidstore_num_tids(dead_items));
}
/* Finally, add page-local counts to whole-VACUUM counts */
@@ -2088,8 +2092,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TIDStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2098,17 +2101,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- Assert(dead_items->num_items <= dead_items->max_items);
pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ tidstore_num_tids(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2157,7 +2153,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
return;
}
@@ -2186,7 +2182,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2213,8 +2209,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2259,7 +2255,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ /* tidstore_reset(vacrel->dead_items); */
}
/*
@@ -2331,7 +2327,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2368,10 +2364,10 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index;
BlockNumber vacuumed_pages;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TIDStoreIter *iter;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2388,8 +2384,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuumed_pages = 0;
- index = 0;
- while (index < vacrel->dead_items->num_items)
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ while (tidstore_iterate_next(iter))
{
BlockNumber tblk;
Buffer buf;
@@ -2398,12 +2394,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- tblk = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ tblk = iter->blkno;
vacrel->blkno = tblk;
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, tblk, buf, index, &vmbuffer);
+ lazy_vacuum_heap_page(vacrel, tblk, iter->offsets, iter->num_offsets,
+ buf, &vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2427,14 +2424,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2451,11 +2447,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* LP_DEAD item on the page. The return value is the first index immediately
* after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer *vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets, Buffer buffer, Buffer *vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int uncnt = 0;
@@ -2474,16 +2469,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = offsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2563,7 +2553,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -3065,46 +3054,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
return vacrel->nonempty_pages;
}
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
/*
* Allocate dead_items (either using palloc, or in dynamic shared memory).
* Sets dead_items in vacrel for caller.
@@ -3115,12 +3064,6 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
-
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
* be used for an index, so we invoke parallelism only if there are at
@@ -3146,7 +3089,6 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3159,11 +3101,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = tidstore_create(NULL);
}
/*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index a6d5ed1f6b..62db8b0101 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -95,7 +95,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int vac_cmp_itemptr(const void *left, const void *right);
/*
* Primary entry point for manual VACUUM and ANALYZE commands
@@ -2283,16 +2282,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TIDStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ tidstore_num_tids(dead_items))));
return istat;
}
@@ -2323,18 +2322,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
@@ -2345,60 +2332,7 @@ vac_max_items_to_alloc_size(int max_items)
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch((void *) itemptr,
- (void *) dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
-
- return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
- BlockNumber lblk,
- rblk;
- OffsetNumber loff,
- roff;
-
- lblk = ItemPointerGetBlockNumber((ItemPointer) left);
- rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
- if (lblk < rblk)
- return -1;
- if (lblk > rblk)
- return 1;
-
- loff = ItemPointerGetOffsetNumber((ItemPointer) left);
- roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
- if (loff < roff)
- return -1;
- if (loff > roff)
- return 1;
+ TIDStore *dead_items = (TIDStore *) state;
- return 0;
+ return tidstore_lookup_tid(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index f26d796e52..742039b3a6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
* use small integers.
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 3
#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 4
#define PARALLEL_VACUUM_KEY_WAL_USAGE 5
@@ -103,6 +103,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TIDStore */
+ tidstore_handle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
+ dsa_area *dead_items_area;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -222,20 +226,22 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -283,9 +289,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -351,6 +356,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = tidstore_create(dead_items_dsa);
+ pvs->dead_items = dead_items;
+ pvs->dead_items_area = dead_items_dsa;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -360,6 +375,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = tidstore_get_handle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +384,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -434,6 +441,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ tidstore_free(pvs->dead_items);
+ dsa_detach(pvs->dead_items_area);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -442,7 +452,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TIDStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -940,7 +950,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -984,10 +996,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1045,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ tidstore_detach(pvs.dead_items);
+ dsa_detach(dead_items_area);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a5ad36ca78..2fb30fe2e7 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -183,6 +183,8 @@ static const char *const BuiltinTrancheNames[] = {
"PgStatsHash",
/* LWTRANCHE_PGSTATS_DATA: */
"PgStatsData",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..f4ccf1dbc5
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,60 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * TID storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "lib/radixtree.h"
+#include "storage/itemptr.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TIDStore TIDStore;
+
+typedef struct TIDStoreIter
+{
+ TIDStore *ts;
+
+ rt_iter *tree_iter;
+
+ bool finished;
+
+ uint64 next_key;
+ uint64 next_val;
+
+ BlockNumber blkno;
+ OffsetNumber offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+ int num_offsets;
+
+#ifdef USE_ASSERT_CHECKING
+ uint64 itemptrs_index;
+ int prev_index;
+#endif
+} TIDStoreIter;
+
+extern TIDStore *tidstore_create(dsa_area *dsa);
+extern TIDStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TIDStore *ts);
+extern void tidstore_free(TIDStore *ts);
+extern void tidstore_reset(TIDStore *ts);
+extern void tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TIDStore *ts, ItemPointer tid);
+extern TIDStoreIter * tidstore_begin_iterate(TIDStore *ts);
+extern bool tidstore_iterate_next(TIDStoreIter *iter);
+extern uint64 tidstore_num_tids(TIDStore *ts);
+extern uint64 tidstore_memory_usage(TIDStore *ts);
+extern tidstore_handle tidstore_get_handle(TIDStore *ts);
+
+#endif /* TIDSTORE_H */
+
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 4e4bc26a8b..c15e6d7a66 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -235,21 +236,6 @@ typedef struct VacuumParams
int nworkers;
} VacuumParams;
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -302,18 +288,16 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TIDStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TIDStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index a494cb598f..88e35254d1 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -201,6 +201,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DSA,
LWTRANCHE_PGSTATS_HASH,
LWTRANCHE_PGSTATS_DATA,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
--
2.38.1
v13-0007-PoC-DSA-support-for-radix-tree.patchtext/x-patch; charset=US-ASCII; name=v13-0007-PoC-DSA-support-for-radix-tree.patchDownload
From f413f05673b9f85a62ef16f2b0c51614362f62ec Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 16:42:55 +0700
Subject: [PATCH v13 7/8] PoC: DSA support for radix tree
---
.../bench_radix_tree--1.0.sql | 2 +
contrib/bench_radix_tree/bench_radix_tree.c | 16 +-
src/backend/lib/radixtree.c | 437 ++++++++++++++----
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 8 +-
src/include/utils/dsa.h | 1 +
.../expected/test_radixtree.out | 25 +
.../modules/test_radixtree/test_radixtree.c | 147 ++++--
8 files changed, 502 insertions(+), 146 deletions(-)
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 83529805fc..d9216d715c 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -7,6 +7,7 @@ create function bench_shuffle_search(
minblk int4,
maxblk int4,
random_block bool DEFAULT false,
+shared bool DEFAULT false,
OUT nkeys int8,
OUT rt_mem_allocated int8,
OUT array_mem_allocated int8,
@@ -23,6 +24,7 @@ create function bench_seq_search(
minblk int4,
maxblk int4,
random_block bool DEFAULT false,
+shared bool DEFAULT false,
OUT nkeys int8,
OUT rt_mem_allocated int8,
OUT array_mem_allocated int8,
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index a0693695e6..1a26722495 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -154,6 +154,8 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
BlockNumber maxblk = PG_GETARG_INT32(1);
bool random_block = PG_GETARG_BOOL(2);
radix_tree *rt = NULL;
+ bool shared = PG_GETARG_BOOL(3);
+ dsa_area *dsa = NULL;
uint64 ntids;
uint64 key;
uint64 last_key = PG_UINT64_MAX;
@@ -176,7 +178,11 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
/* measure the load time of the radix tree */
- rt = rt_create(CurrentMemoryContext);
+ if (shared)
+ dsa = dsa_create(LWLockNewTrancheId());
+ rt = rt_create(CurrentMemoryContext, dsa);
+
+ /* measure the load time of the radix tree */
start_time = GetCurrentTimestamp();
for (int i = 0; i < ntids; i++)
{
@@ -327,7 +333,7 @@ bench_load_random_int(PG_FUNCTION_ARGS)
elog(ERROR, "return type must be a row type");
pg_prng_seed(&state, 0);
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
start_time = GetCurrentTimestamp();
for (uint64 i = 0; i < cnt; i++)
@@ -393,7 +399,7 @@ bench_search_random_nodes(PG_FUNCTION_ARGS)
}
elog(NOTICE, "bench with filter 0x%lX", filter);
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
for (uint64 i = 0; i < cnt; i++)
{
@@ -462,7 +468,7 @@ bench_fixed_height_search(PG_FUNCTION_ARGS)
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
start_time = GetCurrentTimestamp();
@@ -574,7 +580,7 @@ bench_node128_load(PG_FUNCTION_ARGS)
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
key_id = 0;
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index bff37a2c35..b890c38b1a 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -22,6 +22,15 @@
* choose it to avoid an additional pointer traversal. It is the reason this code
* currently does not support variable-length keys.
*
+ * If DSA area is specified for rt_create(), the radix tree is created in the
+ * DSA area so that multiple processes can access to it simultaneously. The process
+ * who created the shared radix tree needs to tell both DSA area specified when
+ * calling to rt_create() and dsa_pointer of the radix tree, fetched by
+ * rt_get_dsa_pointer(), to other processes so that they can attach by rt_attach().
+ *
+ * XXX: shared radix tree is still PoC state as it doesn't have any locking support.
+ * Also, it supports the iteration only by one process.
+ *
* XXX: Most functions in this file have two variants for inner nodes and leaf
* nodes, therefore there are duplication codes. While this sometimes makes the
* code maintenance tricky, this reduces branch prediction misses when judging
@@ -34,6 +43,9 @@
*
* rt_create - Create a new, empty radix tree
* rt_free - Free the radix tree
+ * rt_attach - Attach to the radix tree
+ * rt_detach - Detach from the radix tree
+ * rt_get_handle - Return the handle of the radix tree
* rt_search - Search a key-value pair
* rt_set - Set a key-value pair
* rt_delete - Delete a key-value pair
@@ -65,6 +77,7 @@
#include "nodes/bitmapset.h"
#include "port/pg_bitutils.h"
#include "port/pg_lfind.h"
+#include "utils/dsa.h"
#include "utils/memutils.h"
#ifdef RT_DEBUG
@@ -426,6 +439,10 @@ static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
* construct the key whenever updating the node iteration information, e.g., when
* advancing the current index within the node or when moving to the next node
* at the same level.
+ *
+ * XXX: We need either a safeguard to disallow other processes to begin the
+ * iteration while one process is doing or to allow multiple processes to do
+ * the iteration.
*/
typedef struct rt_node_iter
{
@@ -445,23 +462,43 @@ struct rt_iter
uint64 key;
};
-/* A radix tree with nodes */
-struct radix_tree
+/* A magic value used to identify our radix tree */
+#define RADIXTREE_MAGIC 0x54A48167
+
+/* Control information for an radix tree */
+typedef struct radix_tree_control
{
- MemoryContext context;
+ rt_handle handle;
+ uint32 magic;
+ /* Root node */
rt_pointer root;
+
uint64 max_val;
uint64 num_keys;
- MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
- MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
-
/* statistics */
#ifdef RT_DEBUG
int32 cnt[RT_SIZE_CLASS_COUNT];
#endif
+} radix_tree_control;
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ /* control object in either backend-local memory or DSA */
+ radix_tree_control *ctl;
+
+ /* used only when the radix tree is shared */
+ dsa_area *area;
+
+ /* used only when the radix tree is private */
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
};
+#define RadixTreeIsShared(rt) ((rt)->area != NULL)
static void rt_new_root(radix_tree *tree, uint64 key);
@@ -490,9 +527,12 @@ static void rt_verify_node(rt_node_ptr node);
/* Decode and encode functions of rt_pointer */
static inline rt_node *
-rt_pointer_decode(rt_pointer encoded)
+rt_pointer_decode(radix_tree *tree, rt_pointer encoded)
{
- return (rt_node *) encoded;
+ if (RadixTreeIsShared(tree))
+ return (rt_node *) dsa_get_address(tree->area, encoded);
+ else
+ return (rt_node *) encoded;
}
static inline rt_pointer
@@ -503,11 +543,11 @@ rt_pointer_encode(rt_node *decoded)
/* Return a rt_node_ptr created from the given encoded pointer */
static inline rt_node_ptr
-rt_node_ptr_encoded(rt_pointer encoded)
+rt_node_ptr_encoded(radix_tree *tree, rt_pointer encoded)
{
return (rt_node_ptr) {
.encoded = encoded,
- .decoded = rt_pointer_decode(encoded),
+ .decoded = rt_pointer_decode(tree, encoded)
};
}
@@ -954,8 +994,8 @@ rt_new_root(radix_tree *tree, uint64 key)
rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
NODE_SHIFT(newnode) = shift;
- tree->max_val = shift_get_max_val(shift);
- tree->root = newnode.encoded;
+ tree->ctl->max_val = shift_get_max_val(shift);
+ tree->ctl->root = newnode.encoded;
}
/*
@@ -964,20 +1004,35 @@ rt_new_root(radix_tree *tree, uint64 key)
static rt_node_ptr
rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
{
- rt_node_ptr newnode;
+ rt_node_ptr newnode;
- if (inner)
- newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
- rt_size_class_info[size_class].inner_size);
- else
- newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
- rt_size_class_info[size_class].leaf_size);
+ if (tree->area != NULL)
+ {
+ dsa_pointer dp;
- newnode.encoded = rt_pointer_encode(newnode.decoded);
+ if (inner)
+ dp = dsa_allocate(tree->area, rt_size_class_info[size_class].inner_size);
+ else
+ dp = dsa_allocate(tree->area, rt_size_class_info[size_class].leaf_size);
+
+ newnode.encoded = (rt_pointer) dp;
+ newnode.decoded = rt_pointer_decode(tree, newnode.encoded);
+ }
+ else
+ {
+ if (inner)
+ newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+ rt_size_class_info[size_class].inner_size);
+ else
+ newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ rt_size_class_info[size_class].leaf_size);
+
+ newnode.encoded = rt_pointer_encode(newnode.decoded);
+ }
#ifdef RT_DEBUG
/* update the statistics */
- tree->cnt[size_class]++;
+ tree->ctl->cnt[size_class]++;
#endif
return newnode;
@@ -1041,10 +1096,10 @@ static void
rt_free_node(radix_tree *tree, rt_node_ptr node)
{
/* If we're deleting the root node, make the tree empty */
- if (tree->root == node.encoded)
+ if (tree->ctl->root == node.encoded)
{
- tree->root = InvalidRTPointer;
- tree->max_val = 0;
+ tree->ctl->root = InvalidRTPointer;
+ tree->ctl->max_val = 0;
}
#ifdef RT_DEBUG
@@ -1062,12 +1117,15 @@ rt_free_node(radix_tree *tree, rt_node_ptr node)
if (i == RT_SIZE_CLASS_COUNT)
i = RT_CLASS_256;
- tree->cnt[i]--;
- Assert(tree->cnt[i] >= 0);
+ tree->ctl->cnt[i]--;
+ Assert(tree->ctl->cnt[i] >= 0);
}
#endif
- pfree(node.decoded);
+ if (RadixTreeIsShared(tree))
+ dsa_free(tree->area, (dsa_pointer) node.encoded);
+ else
+ pfree(node.decoded);
}
/*
@@ -1083,7 +1141,7 @@ rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
if (rt_node_ptr_eq(&parent, &old_child))
{
/* Replace the root node with the new large node */
- tree->root = new_child.encoded;
+ tree->ctl->root = new_child.encoded;
}
else
{
@@ -1105,7 +1163,7 @@ static void
rt_extend(radix_tree *tree, uint64 key)
{
int target_shift;
- rt_node *root = rt_pointer_decode(tree->root);
+ rt_node *root = rt_pointer_decode(tree, tree->ctl->root);
int shift = root->shift + RT_NODE_SPAN;
target_shift = key_get_shift(key);
@@ -1123,15 +1181,15 @@ rt_extend(radix_tree *tree, uint64 key)
n4->base.n.shift = shift;
n4->base.n.count = 1;
n4->base.chunks[0] = 0;
- n4->children[0] = tree->root;
+ n4->children[0] = tree->ctl->root;
root->chunk = 0;
- tree->root = node.encoded;
+ tree->ctl->root = node.encoded;
shift += RT_NODE_SPAN;
}
- tree->max_val = shift_get_max_val(target_shift);
+ tree->ctl->max_val = shift_get_max_val(target_shift);
}
/*
@@ -1163,7 +1221,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
}
rt_node_insert_leaf(tree, parent, node, key, value);
- tree->num_keys++;
+ tree->ctl->num_keys++;
}
/*
@@ -1174,12 +1232,11 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
* pointer is set to child_p.
*/
static inline bool
-rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
- rt_pointer *child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action, rt_pointer *child_p)
{
uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool found = false;
- rt_pointer child;
+ rt_pointer child = InvalidRTPointer;
switch (NODE_KIND(node))
{
@@ -1210,6 +1267,7 @@ rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
break;
found = true;
+
if (action == RT_ACTION_FIND)
child = n32->children[idx];
else /* RT_ACTION_DELETE */
@@ -1761,33 +1819,51 @@ retry_insert_leaf_32:
* Create the radix tree in the given memory context and return it.
*/
radix_tree *
-rt_create(MemoryContext ctx)
+rt_create(MemoryContext ctx, dsa_area *area)
{
radix_tree *tree;
MemoryContext old_ctx;
old_ctx = MemoryContextSwitchTo(ctx);
- tree = palloc(sizeof(radix_tree));
+ tree = (radix_tree *) palloc0(sizeof(radix_tree));
tree->context = ctx;
- tree->root = InvalidRTPointer;
- tree->max_val = 0;
- tree->num_keys = 0;
+
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+
+ tree->area = area;
+ dp = dsa_allocate0(area, sizeof(radix_tree_control));
+ tree->ctl = (radix_tree_control *) dsa_get_address(area, dp);
+ tree->ctl->handle = (rt_handle) dp;
+ }
+ else
+ {
+ tree->ctl = (radix_tree_control *) palloc0(sizeof(radix_tree_control));
+ tree->ctl->handle = InvalidDsaPointer;
+ }
+
+ tree->ctl->magic = RADIXTREE_MAGIC;
+ tree->ctl->root = InvalidRTPointer;
/* Create the slab allocator for each size class */
- for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ if (area == NULL)
{
- tree->inner_slabs[i] = SlabContextCreate(ctx,
- rt_size_class_info[i].name,
- rt_size_class_info[i].inner_blocksize,
- rt_size_class_info[i].inner_size);
- tree->leaf_slabs[i] = SlabContextCreate(ctx,
- rt_size_class_info[i].name,
- rt_size_class_info[i].leaf_blocksize,
- rt_size_class_info[i].leaf_size);
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].leaf_blocksize,
+ rt_size_class_info[i].leaf_size);
#ifdef RT_DEBUG
- tree->cnt[i] = 0;
+ tree->ctl->cnt[i] = 0;
#endif
+ }
}
MemoryContextSwitchTo(old_ctx);
@@ -1795,16 +1871,163 @@ rt_create(MemoryContext ctx)
return tree;
}
+/*
+ * Get a handle that can be used by other processes to attach to this radix
+ * tree.
+ */
+dsa_pointer
+rt_get_handle(radix_tree *tree)
+{
+ Assert(RadixTreeIsShared(tree));
+ Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+ return tree->ctl->handle;
+}
+
+/*
+ * Attach to an existing radix tree using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+radix_tree *
+rt_attach(dsa_area *area, rt_handle handle)
+{
+ radix_tree *tree;
+ dsa_pointer control;
+
+ /* Allocate the backend-local object representing the radix tree */
+ tree = (radix_tree *) palloc0(sizeof(radix_tree));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ /* Set up the local radix tree */
+ tree->area = area;
+ tree->ctl = (radix_tree_control *) dsa_get_address(area, control);
+ Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+ return tree;
+}
+
+/*
+ * Detach from a radix tree. This frees backend-local resources associated
+ * with the radix tree, but the radix tree will continue to exist until
+ * it is explicitly freed.
+ */
+void
+rt_detach(radix_tree *tree)
+{
+ Assert(RadixTreeIsShared(tree));
+ Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+ pfree(tree);
+}
+
+/*
+ * Recursively free all nodes allocated to the dsa area.
+ */
+static void
+rt_free_recurse(radix_tree *tree, rt_pointer ptr)
+{
+ rt_node_ptr node = rt_node_ptr_encoded(tree, ptr);
+
+ Assert(RadixTreeIsShared(tree));
+
+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();
+
+ /* The leaf node doesn't have child pointers, so free it */
+ if (NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->area, (dsa_pointer) node.encoded);
+ return;
+ }
+
+ switch (NODE_KIND(node))
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < NODE_COUNT(node); i++)
+ rt_free_recurse(tree, n4->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < NODE_COUNT(node); i++)
+ rt_free_recurse(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ continue;
+
+ rt_free_recurse(tree, node_inner_125_get_child(n125, i));
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_inner_256_is_chunk_used(n256, i))
+ continue;
+
+ rt_free_recurse(tree, node_inner_256_get_child(n256, i));
+ }
+ break;
+ }
+ }
+
+ /* Free the inner node itself */
+ dsa_free(tree->area, node.encoded);
+}
+
/*
* Free the given radix tree.
*/
void
rt_free(radix_tree *tree)
{
- for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+ if (RadixTreeIsShared(tree))
{
- MemoryContextDelete(tree->inner_slabs[i]);
- MemoryContextDelete(tree->leaf_slabs[i]);
+ /* Free all memory used for radix tree nodes */
+ if (RTPointerIsValid(tree->ctl->root))
+ rt_free_recurse(tree, tree->ctl->root);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->area, tree->ctl->handle);
+ }
+ else
+ {
+ /* Free all memory used for radix tree nodes */
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+ pfree(tree->ctl);
}
pfree(tree);
@@ -1822,16 +2045,18 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
rt_node_ptr node;
rt_node_ptr parent;
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
/* Empty tree, create the root */
- if (!RTPointerIsValid(tree->root))
+ if (!RTPointerIsValid(tree->ctl->root))
rt_new_root(tree, key);
/* Extend the tree if necessary */
- if (key > tree->max_val)
+ if (key > tree->ctl->max_val)
rt_extend(tree, key);
/* Descend the tree until a leaf node */
- node = parent = rt_node_ptr_encoded(tree->root);
+ node = parent = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
while (shift >= 0)
{
@@ -1847,7 +2072,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
}
parent = node;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
}
@@ -1855,7 +2080,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
/* Update the statistics */
if (!updated)
- tree->num_keys++;
+ tree->ctl->num_keys++;
return updated;
}
@@ -1871,12 +2096,13 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
rt_node_ptr node;
int shift;
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
Assert(value_p != NULL);
- if (!RTPointerIsValid(tree->root) || key > tree->max_val)
+ if (!RTPointerIsValid(tree->ctl->root) || key > tree->ctl->max_val)
return false;
- node = rt_node_ptr_encoded(tree->root);
+ node = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
/* Descend the tree until a leaf node */
@@ -1890,7 +2116,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
}
@@ -1910,14 +2136,16 @@ rt_delete(radix_tree *tree, uint64 key)
int level;
bool deleted;
- if (!tree->root || key > tree->max_val)
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+ if (!RTPointerIsValid(tree->ctl->root) || key > tree->ctl->max_val)
return false;
/*
* Descend the tree to search the key while building a stack of nodes we
* visited.
*/
- node = rt_node_ptr_encoded(tree->root);
+ node = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
level = -1;
while (shift > 0)
@@ -1930,7 +2158,7 @@ rt_delete(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
}
@@ -1945,7 +2173,7 @@ rt_delete(radix_tree *tree, uint64 key)
}
/* Found the key to delete. Update the statistics */
- tree->num_keys--;
+ tree->ctl->num_keys--;
/*
* Return if the leaf node still has keys and we don't need to delete the
@@ -1985,16 +2213,18 @@ rt_begin_iterate(radix_tree *tree)
rt_iter *iter;
int top_level;
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
old_ctx = MemoryContextSwitchTo(tree->context);
iter = (rt_iter *) palloc0(sizeof(rt_iter));
iter->tree = tree;
/* empty tree */
- if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->root))
+ if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->ctl->root))
return iter;
- root = rt_node_ptr_encoded(iter->tree->root);
+ root = rt_node_ptr_encoded(tree, iter->tree->ctl->root);
top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
iter->stack_len = top_level;
@@ -2045,8 +2275,10 @@ rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
bool
rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
{
+ Assert(!RadixTreeIsShared(iter->tree) || iter->tree->ctl->magic == RADIXTREE_MAGIC);
+
/* Empty tree */
- if (!iter->tree->root)
+ if (!iter->tree->ctl->root)
return false;
for (;;)
@@ -2190,7 +2422,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *
if (found)
{
rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
- *child_p = rt_node_ptr_encoded(child);
+ *child_p = rt_node_ptr_encoded(iter->tree, child);
}
return found;
@@ -2293,7 +2525,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_
uint64
rt_num_entries(radix_tree *tree)
{
- return tree->num_keys;
+ return tree->ctl->num_keys;
}
/*
@@ -2302,12 +2534,19 @@ rt_num_entries(radix_tree *tree)
uint64
rt_memory_usage(radix_tree *tree)
{
- Size total = sizeof(radix_tree);
+ Size total = sizeof(radix_tree) + sizeof(radix_tree_control);
- for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+ if (RadixTreeIsShared(tree))
+ total = dsa_get_total_size(tree->area);
+ else
{
- total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
- total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
}
return total;
@@ -2391,23 +2630,23 @@ rt_verify_node(rt_node_ptr node)
void
rt_stats(radix_tree *tree)
{
- rt_node *root = rt_pointer_decode(tree->root);
+ rt_node *root = rt_pointer_decode(tree, tree->ctl->root);
if (root == NULL)
return;
ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
- tree->num_keys,
+ tree->ctl->num_keys,
root->shift / RT_NODE_SPAN,
- tree->cnt[RT_CLASS_4_FULL],
- tree->cnt[RT_CLASS_32_PARTIAL],
- tree->cnt[RT_CLASS_32_FULL],
- tree->cnt[RT_CLASS_125_FULL],
- tree->cnt[RT_CLASS_256])));
+ tree->ctl->cnt[RT_CLASS_4_FULL],
+ tree->ctl->cnt[RT_CLASS_32_PARTIAL],
+ tree->ctl->cnt[RT_CLASS_32_FULL],
+ tree->ctl->cnt[RT_CLASS_125_FULL],
+ tree->ctl->cnt[RT_CLASS_256])));
}
static void
-rt_dump_node(rt_node_ptr node, int level, bool recurse)
+rt_dump_node(radix_tree *tree, rt_node_ptr node, int level, bool recurse)
{
rt_node *n = node.decoded;
char space[128] = {0};
@@ -2445,7 +2684,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
space, n4->base.chunks[i]);
if (recurse)
- rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+ rt_dump_node(tree, rt_node_ptr_encoded(tree, n4->children[i]),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2473,7 +2712,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
if (recurse)
{
- rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+ rt_dump_node(tree, rt_node_ptr_encoded(tree, n32->children[i]),
level + 1, recurse);
}
else
@@ -2526,7 +2765,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(rt_node_ptr_encoded(node_inner_125_get_child(n125, i)),
+ rt_dump_node(tree,
+ rt_node_ptr_encoded(tree,
+ node_inner_125_get_child(n125, i)),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2559,7 +2800,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+ rt_dump_node(tree,
+ rt_node_ptr_encoded(tree,
+ node_inner_256_get_child(n256, i)),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2579,28 +2822,28 @@ rt_dump_search(radix_tree *tree, uint64 key)
elog(NOTICE, "-----------------------------------------------------------");
elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
- tree->max_val, tree->max_val);
+ tree->ctl->max_val, tree->ctl->max_val);
- if (!RTPointerIsValid(tree->root))
+ if (!RTPointerIsValid(tree->ctl->root))
{
elog(NOTICE, "tree is empty");
return;
}
- if (key > tree->max_val)
+ if (key > tree->ctl->max_val)
{
elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
key, key);
return;
}
- node = rt_node_ptr_encoded(tree->root);
+ node = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
while (shift >= 0)
{
rt_pointer child;
- rt_dump_node(node, level, false);
+ rt_dump_node(tree, node, level, false);
if (NODE_IS_LEAF(node))
{
@@ -2615,7 +2858,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
break;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
level++;
}
@@ -2633,15 +2876,15 @@ rt_dump(radix_tree *tree)
rt_size_class_info[i].inner_blocksize,
rt_size_class_info[i].leaf_size,
rt_size_class_info[i].leaf_blocksize);
- fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
- if (!RTPointerIsValid(tree->root))
+ if (!RTPointerIsValid(tree->ctl->root))
{
fprintf(stderr, "empty tree\n");
return;
}
- root = rt_node_ptr_encoded(tree->root);
- rt_dump_node(root, 0, true);
+ root = rt_node_ptr_encoded(tree, tree->ctl->root);
+ rt_dump_node(tree, root, 0, true);
}
#endif
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 82376fde2d..ad169882af 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d5d7668617..68a11df970 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -14,18 +14,24 @@
#define RADIXTREE_H
#include "postgres.h"
+#include "utils/dsa.h"
#define RT_DEBUG 1
typedef struct radix_tree radix_tree;
typedef struct rt_iter rt_iter;
+typedef dsa_pointer rt_handle;
-extern radix_tree *rt_create(MemoryContext ctx);
+extern radix_tree *rt_create(MemoryContext ctx, dsa_area *dsa);
extern void rt_free(radix_tree *tree);
extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
extern rt_iter *rt_begin_iterate(radix_tree *tree);
+extern rt_handle rt_get_handle(radix_tree *tree);
+extern radix_tree *rt_attach(dsa_area *dsa, dsa_pointer dp);
+extern void rt_detach(radix_tree *tree);
+
extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
extern void rt_end_iterate(rt_iter *iter);
extern bool rt_delete(radix_tree *tree, uint64 key);
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 405606fe2f..dad06adecc 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index ce645cb8b5..a217e0d312 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -6,28 +6,53 @@ CREATE EXTENSION test_radixtree;
SELECT test_radixtree();
NOTICE: testing basic operations with leaf node 4
NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
NOTICE: testing basic operations with leaf node 32
NOTICE: testing basic operations with inner node 32
NOTICE: testing basic operations with leaf node 125
NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
NOTICE: testing basic operations with leaf node 256
NOTICE: testing basic operations with inner node 256
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
NOTICE: testing radix tree node types with shift "0"
NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "8"
NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
NOTICE: testing radix tree node types with shift "24"
NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "32"
NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
NOTICE: testing radix tree node types with shift "48"
NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree node types with shift "56"
NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
NOTICE: testing radix tree with pattern "alternating bits"
NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of ten"
NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
NOTICE: testing radix tree with pattern "one-every-64k"
NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "sparse"
NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
test_radixtree
----------------
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index ea993e63df..fe1e168ec4 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -19,6 +19,7 @@
#include "nodes/bitmapset.h"
#include "storage/block.h"
#include "storage/itemptr.h"
+#include "storage/lwlock.h"
#include "utils/memutils.h"
#include "utils/timestamp.h"
@@ -99,6 +100,8 @@ static const test_spec test_specs[] = {
}
};
+static int lwlock_tranche_id;
+
PG_MODULE_MAGIC;
PG_FUNCTION_INFO_V1(test_radixtree);
@@ -112,7 +115,7 @@ test_empty(void)
uint64 key;
uint64 val;
- radixtree = rt_create(CurrentMemoryContext);
+ radixtree = rt_create(CurrentMemoryContext, NULL);
if (rt_search(radixtree, 0, &dummy))
elog(ERROR, "rt_search on empty tree returned true");
@@ -140,17 +143,14 @@ test_empty(void)
}
static void
-test_basic(int children, bool test_inner)
+do_test_basic(radix_tree *radixtree, int children, bool test_inner)
{
- radix_tree *radixtree;
uint64 *keys;
int shift = test_inner ? 8 : 0;
elog(NOTICE, "testing basic operations with %s node %d",
test_inner ? "inner" : "leaf", children);
- radixtree = rt_create(CurrentMemoryContext);
-
/* prepare keys in order like 1, 32, 2, 31, 2, ... */
keys = palloc(sizeof(uint64) * children);
for (int i = 0; i < children; i++)
@@ -165,7 +165,7 @@ test_basic(int children, bool test_inner)
for (int i = 0; i < children; i++)
{
if (rt_set(radixtree, keys[i], keys[i]))
- elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found %d", keys[i], i);
}
/* update keys */
@@ -185,7 +185,38 @@ test_basic(int children, bool test_inner)
}
pfree(keys);
- rt_free(radixtree);
+}
+
+static void
+test_basic()
+{
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ radix_tree *tree;
+ dsa_area *area;
+
+ /* Test the local radix tree */
+ tree = rt_create(CurrentMemoryContext, NULL);
+ do_test_basic(tree, rt_node_kind_fanouts[i], false);
+ rt_free(tree);
+
+ tree = rt_create(CurrentMemoryContext, NULL);
+ do_test_basic(tree, rt_node_kind_fanouts[i], true);
+ rt_free(tree);
+
+ /* Test the shared radix tree */
+ area = dsa_create(lwlock_tranche_id);
+ tree = rt_create(CurrentMemoryContext, area);
+ do_test_basic(tree, rt_node_kind_fanouts[i], false);
+ rt_free(tree);
+ dsa_detach(area);
+
+ area = dsa_create(lwlock_tranche_id);
+ tree = rt_create(CurrentMemoryContext, area);
+ do_test_basic(tree, rt_node_kind_fanouts[i], true);
+ rt_free(tree);
+ dsa_detach(area);
+ }
}
/*
@@ -286,14 +317,10 @@ test_node_types_delete(radix_tree *radixtree, uint8 shift)
* level.
*/
static void
-test_node_types(uint8 shift)
+do_test_node_types(radix_tree *radixtree, uint8 shift)
{
- radix_tree *radixtree;
-
elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
- radixtree = rt_create(CurrentMemoryContext);
-
/*
* Insert and search entries for every node type at the 'shift' level,
* then delete all entries to make it empty, and insert and search entries
@@ -302,19 +329,37 @@ test_node_types(uint8 shift)
test_node_types_insert(radixtree, shift, true);
test_node_types_delete(radixtree, shift);
test_node_types_insert(radixtree, shift, false);
+}
- rt_free(radixtree);
+static void
+test_node_types(void)
+{
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ {
+ radix_tree *tree;
+ dsa_area *area;
+
+ /* Test the local radix tree */
+ tree = rt_create(CurrentMemoryContext, NULL);
+ do_test_node_types(tree, shift);
+ rt_free(tree);
+
+ /* Test the shared radix tree */
+ area = dsa_create(lwlock_tranche_id);
+ tree = rt_create(CurrentMemoryContext, area);
+ do_test_node_types(tree, shift);
+ rt_free(tree);
+ dsa_detach(area);
+ }
}
/*
* Test with a repeating pattern, defined by the 'spec'.
*/
static void
-test_pattern(const test_spec * spec)
+do_test_pattern(radix_tree *radixtree, const test_spec * spec)
{
- radix_tree *radixtree;
rt_iter *iter;
- MemoryContext radixtree_ctx;
TimestampTz starttime;
TimestampTz endtime;
uint64 n;
@@ -340,18 +385,6 @@ test_pattern(const test_spec * spec)
pattern_values[pattern_num_values++] = i;
}
- /*
- * Allocate the radix tree.
- *
- * Allocate it in a separate memory context, so that we can print its
- * memory usage easily.
- */
- radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
- "radixtree test",
- ALLOCSET_SMALL_SIZES);
- MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
- radixtree = rt_create(radixtree_ctx);
-
/*
* Add values to the set.
*/
@@ -405,8 +438,6 @@ test_pattern(const test_spec * spec)
mem_usage = rt_memory_usage(radixtree);
fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
mem_usage, (double) mem_usage / spec->num_values);
-
- MemoryContextStats(radixtree_ctx);
}
/* Check that rt_num_entries works */
@@ -555,27 +586,57 @@ test_pattern(const test_spec * spec)
if ((nbefore - ndeleted) != nafter)
elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
nafter, (nbefore - ndeleted), ndeleted);
+}
+
+static void
+test_patterns(void)
+{
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ {
+ radix_tree *tree;
+ MemoryContext radixtree_ctx;
+ dsa_area *area;
+ const test_spec *spec = &test_specs[i];
- MemoryContextDelete(radixtree_ctx);
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+ /* Test the local radix tree */
+ tree = rt_create(radixtree_ctx, NULL);
+ do_test_pattern(tree, spec);
+ rt_free(tree);
+ MemoryContextReset(radixtree_ctx);
+
+ /* Test the shared radix tree */
+ area = dsa_create(lwlock_tranche_id);
+ tree = rt_create(radixtree_ctx, area);
+ do_test_pattern(tree, spec);
+ rt_free(tree);
+ dsa_detach(area);
+ MemoryContextDelete(radixtree_ctx);
+ }
}
Datum
test_radixtree(PG_FUNCTION_ARGS)
{
- test_empty();
+ /* get a new lwlock tranche id for all tests for shared radix tree */
+ lwlock_tranche_id = LWLockNewTrancheId();
- for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
- {
- test_basic(rt_node_kind_fanouts[i], false);
- test_basic(rt_node_kind_fanouts[i], true);
- }
-
- for (int shift = 0; shift <= (64 - 8); shift += 8)
- test_node_types(shift);
+ test_empty();
+ test_basic();
- /* Test different test patterns, with lots of entries */
- for (int i = 0; i < lengthof(test_specs); i++)
- test_pattern(&test_specs[i]);
+ test_node_types();
+ test_patterns();
PG_RETURN_VOID();
}
--
2.38.1
v13-0006-Use-rt_node_ptr-to-reference-radix-tree-nodes.patchtext/x-patch; charset=US-ASCII; name=v13-0006-Use-rt_node_ptr-to-reference-radix-tree-nodes.patchDownload
From 4dceebdffb8a03e8863d640d25c2d197ef8c16b7 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 14 Nov 2022 11:44:17 +0900
Subject: [PATCH v13 6/8] Use rt_node_ptr to reference radix tree nodes.
---
src/backend/lib/radixtree.c | 688 +++++++++++++++++++++---------------
1 file changed, 398 insertions(+), 290 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index abd0450727..bff37a2c35 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -150,6 +150,19 @@ typedef enum rt_size_class
#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
} rt_size_class;
+/*
+ * rt_pointer is a pointer compatible with a pointer to local memory and a
+ * pointer for DSA area (i.e. dsa_pointer). Since the radix tree node can be
+ * allocated in backend local memory as well as DSA area, we cannot use a
+ * C-pointer to rt_node (i.e. backend local memory address) for child pointers
+ * in inner nodes. Inner nodes need to use rt_pointer instead. We can get
+ * the backend local memory address of a node from a rt_pointer by using
+ * rt_pointer_decode().
+*/
+typedef uintptr_t rt_pointer;
+#define InvalidRTPointer ((rt_pointer) 0)
+#define RTPointerIsValid(x) (((rt_pointer) (x)) != InvalidRTPointer)
+
/* Common type for all nodes types */
typedef struct rt_node
{
@@ -175,8 +188,7 @@ typedef struct rt_node
/* Node kind, one per search/set algorithm */
uint8 kind;
} rt_node;
-#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
-#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
+#define RT_NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
#define VAR_NODE_HAS_FREE_SLOT(node) \
((node)->base.n.count < (node)->base.n.fanout)
#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
@@ -240,7 +252,7 @@ typedef struct rt_node_inner_4
rt_node_base_4 base;
/* number of children depends on size class */
- rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+ rt_pointer children[FLEXIBLE_ARRAY_MEMBER];
} rt_node_inner_4;
typedef struct rt_node_leaf_4
@@ -256,7 +268,7 @@ typedef struct rt_node_inner_32
rt_node_base_32 base;
/* number of children depends on size class */
- rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+ rt_pointer children[FLEXIBLE_ARRAY_MEMBER];
} rt_node_inner_32;
typedef struct rt_node_leaf_32
@@ -272,7 +284,7 @@ typedef struct rt_node_inner_125
rt_node_base_125 base;
/* number of children depends on size class */
- rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+ rt_pointer children[FLEXIBLE_ARRAY_MEMBER];
} rt_node_inner_125;
typedef struct rt_node_leaf_125
@@ -292,7 +304,7 @@ typedef struct rt_node_inner_256
rt_node_base_256 base;
/* Slots for 256 children */
- rt_node *children[RT_NODE_MAX_SLOTS];
+ rt_pointer children[RT_NODE_MAX_SLOTS];
} rt_node_inner_256;
typedef struct rt_node_leaf_256
@@ -306,6 +318,29 @@ typedef struct rt_node_leaf_256
uint64 values[RT_NODE_MAX_SLOTS];
} rt_node_leaf_256;
+/* rt_node_ptr is a data structure representing a pointer for a rt_node */
+typedef struct rt_node_ptr
+{
+ rt_pointer encoded;
+ rt_node *decoded;
+} rt_node_ptr;
+#define InvalidRTNodePtr \
+ (rt_node_ptr) {.encoded = InvalidRTPointer, .decoded = NULL}
+#define RTNodePtrIsValid(n) \
+ (!rt_node_ptr_eq((rt_node_ptr *) &(n), &(InvalidRTNodePtr)))
+
+/* Macros for rt_node_ptr to access the fields of rt_node */
+#define NODE_RAW(n) (n.decoded)
+#define NODE_IS_LEAF(n) (NODE_RAW(n)->shift == 0)
+#define NODE_IS_EMPTY(n) (NODE_COUNT(n) == 0)
+#define NODE_KIND(n) (NODE_RAW(n)->kind)
+#define NODE_COUNT(n) (NODE_RAW(n)->count)
+#define NODE_SHIFT(n) (NODE_RAW(n)->shift)
+#define NODE_CHUNK(n) (NODE_RAW(n)->chunk)
+#define NODE_FANOUT(n) (NODE_RAW(n)->fanout)
+#define NODE_HAS_FREE_SLOT(n) \
+ (NODE_COUNT(n) < rt_node_kind_info[NODE_KIND(n)].fanout)
+
/* Information for each size class */
typedef struct rt_size_class_elem
{
@@ -394,7 +429,7 @@ static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
*/
typedef struct rt_node_iter
{
- rt_node *node; /* current node being iterated */
+ rt_node_ptr node; /* current node being iterated */
int current_idx; /* current position. -1 for initial value */
} rt_node_iter;
@@ -415,7 +450,7 @@ struct radix_tree
{
MemoryContext context;
- rt_node *root;
+ rt_pointer root;
uint64 max_val;
uint64 num_keys;
@@ -429,27 +464,58 @@ struct radix_tree
};
static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
-static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+
+static rt_node_ptr rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node_ptr node, uint8 kind, rt_size_class size_class,
bool inner);
-static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_free_node(radix_tree *tree, rt_node_ptr node);
static void rt_extend(radix_tree *tree, uint64 key);
-static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
- rt_node **child_p);
-static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+static inline bool rt_node_search_inner(rt_node_ptr node_ptr, uint64 key, rt_action action,
+ rt_pointer *child_p);
+static inline bool rt_node_search_leaf(rt_node_ptr node_ptr, uint64 key, rt_action action,
uint64 *value_p);
-static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
- uint64 key, rt_node *child);
-static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+static bool rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+ uint64 key, rt_node_ptr child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
uint64 key, uint64 value);
-static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ rt_node_ptr *child_p);
static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
uint64 *value_p);
-static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static void rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from);
static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
/* verification (available only with assertion) */
-static void rt_verify_node(rt_node *node);
+static void rt_verify_node(rt_node_ptr node);
+
+/* Decode and encode functions of rt_pointer */
+static inline rt_node *
+rt_pointer_decode(rt_pointer encoded)
+{
+ return (rt_node *) encoded;
+}
+
+static inline rt_pointer
+rt_pointer_encode(rt_node *decoded)
+{
+ return (rt_pointer) decoded;
+}
+
+/* Return a rt_node_ptr created from the given encoded pointer */
+static inline rt_node_ptr
+rt_node_ptr_encoded(rt_pointer encoded)
+{
+ return (rt_node_ptr) {
+ .encoded = encoded,
+ .decoded = rt_pointer_decode(encoded),
+ };
+}
+
+static inline bool
+rt_node_ptr_eq(rt_node_ptr *a, rt_node_ptr *b)
+{
+ return (a->decoded == b->decoded) && (a->encoded == b->encoded);
+}
/*
* Return index of the first element in 'base' that equals 'key'. Return -1
@@ -598,10 +664,10 @@ node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
/* Shift the elements right at 'idx' by one */
static inline void
-chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_shift(uint8 *chunks, rt_pointer *children, int count, int idx)
{
memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
- memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_pointer) * (count - idx));
}
static inline void
@@ -613,10 +679,10 @@ chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
/* Delete the element at 'idx' */
static inline void
-chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_delete(uint8 *chunks, rt_pointer *children, int count, int idx)
{
memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
- memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_pointer) * (count - idx - 1));
}
static inline void
@@ -628,12 +694,12 @@ chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
/* Copy both chunks and children/values arrays */
static inline void
-chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
- uint8 *dst_chunks, rt_node **dst_children)
+chunk_children_array_copy(uint8 *src_chunks, rt_pointer *src_children,
+ uint8 *dst_chunks, rt_pointer *dst_children)
{
const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
- const Size children_size = sizeof(rt_node *) * fanout;
+ const Size children_size = sizeof(rt_pointer) * fanout;
memcpy(dst_chunks, src_chunks, chunk_size);
memcpy(dst_children, src_children, children_size);
@@ -665,7 +731,7 @@ node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
static inline bool
node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
Assert(slot < node->base.n.fanout);
return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
}
@@ -673,23 +739,23 @@ node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
static inline bool
node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
Assert(slot < node->base.n.fanout);
return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
}
#endif
-static inline rt_node *
+static inline rt_pointer
node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
return node->children[node->base.slot_idxs[chunk]];
}
static inline uint64
node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
return node->values[node->base.slot_idxs[chunk]];
}
@@ -699,9 +765,9 @@ node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
{
int slotpos = node->base.slot_idxs[chunk];
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
- node->children[node->base.slot_idxs[chunk]] = NULL;
+ node->children[node->base.slot_idxs[chunk]] = InvalidRTPointer;
node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
}
@@ -710,7 +776,7 @@ node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
{
int slotpos = node->base.slot_idxs[chunk];
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
}
@@ -742,11 +808,11 @@ node_125_find_unused_slot(bitmapword *isset)
}
static inline void
-node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_pointer child)
{
int slotpos;
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
slotpos = node_125_find_unused_slot(node->base.isset);
Assert(slotpos < node->base.n.fanout);
@@ -761,7 +827,7 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
{
int slotpos;
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
slotpos = node_125_find_unused_slot(node->base.isset);
Assert(slotpos < node->base.n.fanout);
@@ -772,16 +838,16 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
/* Update the child corresponding to 'chunk' to 'child' */
static inline void
-node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_pointer child)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->children[node->base.slot_idxs[chunk]] = child;
}
static inline void
node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->values[node->base.slot_idxs[chunk]] = value;
}
@@ -791,21 +857,21 @@ node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
static inline bool
node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
- return (node->children[chunk] != NULL);
+ Assert(!RT_NODE_IS_LEAF(node));
+ return RTPointerIsValid(node->children[chunk]);
}
static inline bool
node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
}
-static inline rt_node *
+static inline rt_pointer
node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
Assert(node_inner_256_is_chunk_used(node, chunk));
return node->children[chunk];
}
@@ -813,16 +879,16 @@ node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
static inline uint64
node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
Assert(node_leaf_256_is_chunk_used(node, chunk));
return node->values[chunk];
}
/* Set the child in the node-256 */
static inline void
-node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_pointer child)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->children[chunk] = child;
}
@@ -830,7 +896,7 @@ node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
static inline void
node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
node->values[chunk] = value;
}
@@ -839,14 +905,14 @@ node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
static inline void
node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
- node->children[chunk] = NULL;
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = InvalidRTPointer;
}
static inline void
node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
}
@@ -882,29 +948,32 @@ rt_new_root(radix_tree *tree, uint64 key)
{
int shift = key_get_shift(key);
bool inner = shift > 0;
- rt_node *newnode;
+ rt_node_ptr newnode;
newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
- newnode->shift = shift;
+ NODE_SHIFT(newnode) = shift;
+
tree->max_val = shift_get_max_val(shift);
- tree->root = newnode;
+ tree->root = newnode.encoded;
}
/*
* Allocate a new node with the given node kind.
*/
-static rt_node *
+static rt_node_ptr
rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
{
- rt_node *newnode;
+ rt_node_ptr newnode;
if (inner)
- newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
- rt_size_class_info[size_class].inner_size);
+ newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+ rt_size_class_info[size_class].inner_size);
else
- newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
- rt_size_class_info[size_class].leaf_size);
+ newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ rt_size_class_info[size_class].leaf_size);
+
+ newnode.encoded = rt_pointer_encode(newnode.decoded);
#ifdef RT_DEBUG
/* update the statistics */
@@ -916,20 +985,20 @@ rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
/* Initialize the node contents */
static inline void
-rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+rt_init_node(rt_node_ptr node, uint8 kind, rt_size_class size_class, bool inner)
{
if (inner)
- MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+ MemSet(node.decoded, 0, rt_size_class_info[size_class].inner_size);
else
- MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+ MemSet(node.decoded, 0, rt_size_class_info[size_class].leaf_size);
- node->kind = kind;
- node->fanout = rt_size_class_info[size_class].fanout;
+ NODE_KIND(node) = kind;
+ NODE_FANOUT(node) = rt_size_class_info[size_class].fanout;
/* Initialize slot_idxs to invalid values */
if (kind == RT_NODE_KIND_125)
{
- rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node.decoded;
memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
}
@@ -939,25 +1008,25 @@ rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
* and this is the max size class to it will never grow.
*/
if (kind == RT_NODE_KIND_256)
- node->fanout = 0;
+ NODE_FANOUT(node) = 0;
}
static inline void
-rt_copy_node(rt_node *newnode, rt_node *oldnode)
+rt_copy_node(rt_node_ptr newnode, rt_node_ptr oldnode)
{
- newnode->shift = oldnode->shift;
- newnode->chunk = oldnode->chunk;
- newnode->count = oldnode->count;
+ NODE_SHIFT(newnode) = NODE_SHIFT(oldnode);
+ NODE_CHUNK(newnode) = NODE_CHUNK(oldnode);
+ NODE_COUNT(newnode) = NODE_COUNT(oldnode);
}
/*
* Create a new node with 'new_kind' and the same shift, chunk, and
* count of 'node'.
*/
-static rt_node*
-rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+static rt_node_ptr
+rt_grow_node_kind(radix_tree *tree, rt_node_ptr node, uint8 new_kind)
{
- rt_node *newnode;
+ rt_node_ptr newnode;
bool inner = !NODE_IS_LEAF(node);
newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
@@ -969,12 +1038,12 @@ rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
/* Free the given node */
static void
-rt_free_node(radix_tree *tree, rt_node *node)
+rt_free_node(radix_tree *tree, rt_node_ptr node)
{
/* If we're deleting the root node, make the tree empty */
- if (tree->root == node)
+ if (tree->root == node.encoded)
{
- tree->root = NULL;
+ tree->root = InvalidRTPointer;
tree->max_val = 0;
}
@@ -985,7 +1054,7 @@ rt_free_node(radix_tree *tree, rt_node *node)
/* update the statistics */
for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
- if (node->fanout == rt_size_class_info[i].fanout)
+ if (NODE_FANOUT(node) == rt_size_class_info[i].fanout)
break;
}
@@ -998,29 +1067,30 @@ rt_free_node(radix_tree *tree, rt_node *node)
}
#endif
- pfree(node);
+ pfree(node.decoded);
}
/*
* Replace old_child with new_child, and free the old one.
*/
static void
-rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
- rt_node *new_child, uint64 key)
+rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
+ rt_node_ptr new_child, uint64 key)
{
- Assert(old_child->chunk == new_child->chunk);
- Assert(old_child->shift == new_child->shift);
+ Assert(NODE_CHUNK(old_child) == NODE_CHUNK(new_child));
+ Assert(NODE_SHIFT(old_child) == NODE_SHIFT(new_child));
- if (parent == old_child)
+ if (rt_node_ptr_eq(&parent, &old_child))
{
/* Replace the root node with the new large node */
- tree->root = new_child;
+ tree->root = new_child.encoded;
}
else
{
bool replaced PG_USED_FOR_ASSERTS_ONLY;
- replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+ replaced = rt_node_insert_inner(tree, InvalidRTNodePtr, parent, key,
+ new_child);
Assert(replaced);
}
@@ -1035,24 +1105,28 @@ static void
rt_extend(radix_tree *tree, uint64 key)
{
int target_shift;
- int shift = tree->root->shift + RT_NODE_SPAN;
+ rt_node *root = rt_pointer_decode(tree->root);
+ int shift = root->shift + RT_NODE_SPAN;
target_shift = key_get_shift(key);
/* Grow tree from 'shift' to 'target_shift' */
while (shift <= target_shift)
{
- rt_node_inner_4 *node;
+ rt_node_ptr node;
+ rt_node_inner_4 *n4;
+
+ node = rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+ rt_init_node(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
- node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
- rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
- node->base.n.shift = shift;
- node->base.n.count = 1;
- node->base.chunks[0] = 0;
- node->children[0] = tree->root;
+ n4 = (rt_node_inner_4 *) node.decoded;
+ n4->base.n.shift = shift;
+ n4->base.n.count = 1;
+ n4->base.chunks[0] = 0;
+ n4->children[0] = tree->root;
- tree->root->chunk = 0;
- tree->root = (rt_node *) node;
+ root->chunk = 0;
+ tree->root = node.encoded;
shift += RT_NODE_SPAN;
}
@@ -1065,21 +1139,22 @@ rt_extend(radix_tree *tree, uint64 key)
* Insert inner and leaf nodes from 'node' to bottom.
*/
static inline void
-rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
- rt_node *node)
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
+ rt_node_ptr node)
{
- int shift = node->shift;
+ int shift = NODE_SHIFT(node);
while (shift >= RT_NODE_SPAN)
{
- rt_node *newchild;
+ rt_node_ptr newchild;
int newshift = shift - RT_NODE_SPAN;
bool inner = newshift > 0;
newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
- newchild->shift = newshift;
- newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ NODE_SHIFT(newchild) = newshift;
+ NODE_CHUNK(newchild) = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
+
rt_node_insert_inner(tree, parent, node, key, newchild);
parent = node;
@@ -1099,17 +1174,18 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
* pointer is set to child_p.
*/
static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
+ rt_pointer *child_p)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool found = false;
- rt_node *child = NULL;
+ rt_pointer child;
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
if (idx < 0)
@@ -1127,7 +1203,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
}
case RT_NODE_KIND_32:
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
if (idx < 0)
@@ -1143,7 +1219,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
}
case RT_NODE_KIND_125:
{
- rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
break;
@@ -1159,7 +1235,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
}
case RT_NODE_KIND_256:
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
if (!node_inner_256_is_chunk_used(n256, chunk))
break;
@@ -1176,7 +1252,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
/* update statistics */
if (action == RT_ACTION_DELETE && found)
- node->count--;
+ NODE_COUNT(node)--;
if (found && child_p)
*child_p = child;
@@ -1192,17 +1268,17 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
* to the value is set to value_p.
*/
static inline bool
-rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+rt_node_search_leaf(rt_node_ptr node, uint64 key, rt_action action, uint64 *value_p)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool found = false;
uint64 value = 0;
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
if (idx < 0)
@@ -1220,7 +1296,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
}
case RT_NODE_KIND_32:
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
if (idx < 0)
@@ -1236,7 +1312,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
}
case RT_NODE_KIND_125:
{
- rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
break;
@@ -1252,7 +1328,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
}
case RT_NODE_KIND_256:
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
if (!node_leaf_256_is_chunk_used(n256, chunk))
break;
@@ -1269,7 +1345,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
/* update statistics */
if (action == RT_ACTION_DELETE && found)
- node->count--;
+ NODE_COUNT(node)--;
if (found && value_p)
*value_p = value;
@@ -1279,19 +1355,19 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
/* Insert the child to the inner node */
static bool
-rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
- rt_node *child)
+rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+ uint64 key, rt_node_ptr child)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool chunk_exists = false;
Assert(!NODE_IS_LEAF(node));
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
int idx;
idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1299,25 +1375,27 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
{
/* found the existing chunk */
chunk_exists = true;
- n4->children[idx] = child;
+ n4->children[idx] = child.encoded;
break;
}
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
{
+ rt_node_ptr new;
rt_node_inner_32 *new32;
- Assert(parent != NULL);
+
+ Assert(RTNodePtrIsValid(parent));
/* grow node from 4 to 32 */
- new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+ new32 = (rt_node_inner_32 *) new.decoded;
+
chunk_children_array_copy(n4->base.chunks, n4->children,
new32->base.chunks, new32->children);
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
- key);
- node = (rt_node *) new32;
+ Assert(RTNodePtrIsValid(parent));
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1330,14 +1408,14 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
count, insertpos);
n4->base.chunks[insertpos] = chunk;
- n4->children[insertpos] = child;
+ n4->children[insertpos] = child.encoded;
break;
}
}
/* FALLTHROUGH */
case RT_NODE_KIND_32:
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
int idx;
idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1345,45 +1423,52 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
{
/* found the existing chunk */
chunk_exists = true;
- n32->children[idx] = child;
+ n32->children[idx] = child.encoded;
break;
}
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
{
- Assert(parent != NULL);
+ Assert(RTNodePtrIsValid(parent));
if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
{
/* use the same node kind, but expand to the next size class */
const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size;
const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+ rt_node_ptr new;
rt_node_inner_32 *new32;
- new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+ new = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+ new32 = (rt_node_inner_32 *) new.decoded;
memcpy(new32, n32, size);
new32->base.n.fanout = fanout;
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+ rt_replace_node(tree, parent, node, new, key);
- /* must update both pointers here */
- node = (rt_node *) new32;
+ /*
+ * Must update both pointers here since we update n32 and
+ * verify node.
+ */
+ node = new;
n32 = new32;
goto retry_insert_inner_32;
}
else
{
+ rt_node_ptr new;
rt_node_inner_125 *new125;
/* grow node from 32 to 125 */
- new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
- RT_NODE_KIND_125);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+ new125 = (rt_node_inner_125 *) new.decoded;
+
for (int i = 0; i < n32->base.n.count; i++)
node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
- node = (rt_node *) new125;
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
}
else
@@ -1398,7 +1483,7 @@ retry_insert_inner_32:
count, insertpos);
n32->base.chunks[insertpos] = chunk;
- n32->children[insertpos] = child;
+ n32->children[insertpos] = child.encoded;
break;
}
}
@@ -1406,25 +1491,28 @@ retry_insert_inner_32:
/* FALLTHROUGH */
case RT_NODE_KIND_125:
{
- rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
int cnt = 0;
if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
{
/* found the existing chunk */
chunk_exists = true;
- node_inner_125_update(n125, chunk, child);
+ node_inner_125_update(n125, chunk, child.encoded);
break;
}
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
{
+ rt_node_ptr new;
rt_node_inner_256 *new256;
- Assert(parent != NULL);
+
+ Assert(RTNodePtrIsValid(parent));
/* grow node from 125 to 256 */
- new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
- RT_NODE_KIND_256);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+ new256 = (rt_node_inner_256 *) new.decoded;
+
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
{
if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
@@ -1434,32 +1522,31 @@ retry_insert_inner_32:
cnt++;
}
- rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
- key);
- node = (rt_node *) new256;
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
- node_inner_125_insert(n125, chunk, child);
+ node_inner_125_insert(n125, chunk, child.encoded);
break;
}
}
/* FALLTHROUGH */
case RT_NODE_KIND_256:
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
- node_inner_256_set(n256, chunk, child);
+ node_inner_256_set(n256, chunk, child.encoded);
break;
}
}
/* Update statistics */
if (!chunk_exists)
- node->count++;
+ NODE_COUNT(node)++;
/*
* Done. Finally, verify the chunk and value is inserted or replaced
@@ -1472,19 +1559,19 @@ retry_insert_inner_32:
/* Insert the value to the leaf node */
static bool
-rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
uint64 key, uint64 value)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool chunk_exists = false;
Assert(NODE_IS_LEAF(node));
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
int idx;
idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1498,16 +1585,18 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
{
+ rt_node_ptr new;
rt_node_leaf_32 *new32;
- Assert(parent != NULL);
+
+ Assert(RTNodePtrIsValid(parent));
/* grow node from 4 to 32 */
- new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+ new32 = (rt_node_leaf_32 *) new.decoded;
chunk_values_array_copy(n4->base.chunks, n4->values,
new32->base.chunks, new32->values);
- rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
- node = (rt_node *) new32;
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1527,7 +1616,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* FALLTHROUGH */
case RT_NODE_KIND_32:
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
int idx;
idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1541,45 +1630,51 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
{
- Assert(parent != NULL);
+ Assert(RTNodePtrIsValid(parent));
if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
{
/* use the same node kind, but expand to the next size class */
const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+ rt_node_ptr new;
rt_node_leaf_32 *new32;
- new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+ new = rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+ new32 = (rt_node_leaf_32 *) new.decoded;
memcpy(new32, n32, size);
new32->base.n.fanout = fanout;
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+ rt_replace_node(tree, parent, node, new, key);
- /* must update both pointers here */
- node = (rt_node *) new32;
+ /*
+ * Must update both pointers here since we update n32 and
+ * verify node.
+ */
+ node = new;
n32 = new32;
goto retry_insert_leaf_32;
}
else
{
+ rt_node_ptr new;
rt_node_leaf_125 *new125;
/* grow node from 32 to 125 */
- new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
- RT_NODE_KIND_125);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+ new125 = (rt_node_leaf_125 *) new.decoded;
+
for (int i = 0; i < n32->base.n.count; i++)
node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
- key);
- node = (rt_node *) new125;
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
}
else
{
- retry_insert_leaf_32:
+retry_insert_leaf_32:
{
int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
int count = n32->base.n.count;
@@ -1597,7 +1692,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* FALLTHROUGH */
case RT_NODE_KIND_125:
{
- rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
int cnt = 0;
if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
@@ -1610,12 +1705,14 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
{
+ rt_node_ptr new;
rt_node_leaf_256 *new256;
- Assert(parent != NULL);
+
+ Assert(RTNodePtrIsValid(parent));
/* grow node from 125 to 256 */
- new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
- RT_NODE_KIND_256);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+ new256 = (rt_node_leaf_256 *) new.decoded;
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
{
if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
@@ -1625,9 +1722,8 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
cnt++;
}
- rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
- key);
- node = (rt_node *) new256;
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1638,7 +1734,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* FALLTHROUGH */
case RT_NODE_KIND_256:
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
@@ -1650,7 +1746,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* Update statistics */
if (!chunk_exists)
- node->count++;
+ NODE_COUNT(node)++;
/*
* Done. Finally, verify the chunk and value is inserted or replaced
@@ -1674,7 +1770,7 @@ rt_create(MemoryContext ctx)
tree = palloc(sizeof(radix_tree));
tree->context = ctx;
- tree->root = NULL;
+ tree->root = InvalidRTPointer;
tree->max_val = 0;
tree->num_keys = 0;
@@ -1723,26 +1819,23 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
{
int shift;
bool updated;
- rt_node *node;
- rt_node *parent;
+ rt_node_ptr node;
+ rt_node_ptr parent;
/* Empty tree, create the root */
- if (!tree->root)
+ if (!RTPointerIsValid(tree->root))
rt_new_root(tree, key);
/* Extend the tree if necessary */
if (key > tree->max_val)
rt_extend(tree, key);
- Assert(tree->root);
-
- shift = tree->root->shift;
- node = parent = tree->root;
-
/* Descend the tree until a leaf node */
+ node = parent = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
while (shift >= 0)
{
- rt_node *child;
+ rt_pointer child;
if (NODE_IS_LEAF(node))
break;
@@ -1754,7 +1847,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
}
parent = node;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
}
@@ -1775,21 +1868,21 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
bool
rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
{
- rt_node *node;
+ rt_node_ptr node;
int shift;
Assert(value_p != NULL);
- if (!tree->root || key > tree->max_val)
+ if (!RTPointerIsValid(tree->root) || key > tree->max_val)
return false;
- node = tree->root;
- shift = tree->root->shift;
+ node = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
/* Descend the tree until a leaf node */
while (shift >= 0)
{
- rt_node *child;
+ rt_pointer child;
if (NODE_IS_LEAF(node))
break;
@@ -1797,7 +1890,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
}
@@ -1811,8 +1904,8 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
bool
rt_delete(radix_tree *tree, uint64 key)
{
- rt_node *node;
- rt_node *stack[RT_MAX_LEVEL] = {0};
+ rt_node_ptr node;
+ rt_node_ptr stack[RT_MAX_LEVEL] = {0};
int shift;
int level;
bool deleted;
@@ -1824,12 +1917,12 @@ rt_delete(radix_tree *tree, uint64 key)
* Descend the tree to search the key while building a stack of nodes we
* visited.
*/
- node = tree->root;
- shift = tree->root->shift;
+ node = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
level = -1;
while (shift > 0)
{
- rt_node *child;
+ rt_pointer child;
/* Push the current node to the stack */
stack[++level] = node;
@@ -1837,7 +1930,7 @@ rt_delete(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
}
@@ -1888,6 +1981,7 @@ rt_iter *
rt_begin_iterate(radix_tree *tree)
{
MemoryContext old_ctx;
+ rt_node_ptr root;
rt_iter *iter;
int top_level;
@@ -1897,17 +1991,18 @@ rt_begin_iterate(radix_tree *tree)
iter->tree = tree;
/* empty tree */
- if (!iter->tree->root)
+ if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->root))
return iter;
- top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ root = rt_node_ptr_encoded(iter->tree->root);
+ top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
iter->stack_len = top_level;
/*
* Descend to the left most leaf node from the root. The key is being
* constructed while descending to the leaf.
*/
- rt_update_iter_stack(iter, iter->tree->root, top_level);
+ rt_update_iter_stack(iter, root, top_level);
MemoryContextSwitchTo(old_ctx);
@@ -1918,14 +2013,15 @@ rt_begin_iterate(radix_tree *tree)
* Update each node_iter for inner nodes in the iterator node stack.
*/
static void
-rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
{
int level = from;
- rt_node *node = from_node;
+ rt_node_ptr node = from_node;
for (;;)
{
rt_node_iter *node_iter = &(iter->stack[level--]);
+ bool found PG_USED_FOR_ASSERTS_ONLY;
node_iter->node = node;
node_iter->current_idx = -1;
@@ -1935,10 +2031,10 @@ rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
return;
/* Advance to the next slot in the inner node */
- node = rt_node_inner_iterate_next(iter, node_iter);
+ found = rt_node_inner_iterate_next(iter, node_iter, &node);
/* We must find the first children in the node */
- Assert(node);
+ Assert(found);
}
}
@@ -1955,7 +2051,7 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
for (;;)
{
- rt_node *child = NULL;
+ rt_node_ptr child = InvalidRTNodePtr;
uint64 value;
int level;
bool found;
@@ -1976,14 +2072,12 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
*/
for (level = 1; level <= iter->stack_len; level++)
{
- child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
-
- if (child)
+ if (rt_node_inner_iterate_next(iter, &(iter->stack[level]), &child))
break;
}
/* the iteration finished */
- if (!child)
+ if (!RTNodePtrIsValid(child))
return false;
/*
@@ -2015,18 +2109,19 @@ rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
* Advance the slot in the inner node. Return the child if exists, otherwise
* null.
*/
-static inline rt_node *
-rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+static inline bool
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *child_p)
{
- rt_node *child = NULL;
+ rt_node_ptr node = node_iter->node;
+ rt_pointer child;
bool found = false;
uint8 key_chunk;
- switch (node_iter->node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n4->base.n.count)
@@ -2039,7 +2134,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
case RT_NODE_KIND_32:
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n32->base.n.count)
@@ -2052,7 +2147,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
case RT_NODE_KIND_125:
{
- rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2072,7 +2167,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
case RT_NODE_KIND_256:
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2093,9 +2188,12 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
if (found)
- rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ {
+ rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
+ *child_p = rt_node_ptr_encoded(child);
+ }
- return child;
+ return found;
}
/*
@@ -2103,19 +2201,18 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
* is set to value_p, otherwise return false.
*/
static inline bool
-rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
- uint64 *value_p)
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_p)
{
- rt_node *node = node_iter->node;
+ rt_node_ptr node = node_iter->node;
bool found = false;
uint64 value;
uint8 key_chunk;
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n4->base.n.count)
@@ -2128,7 +2225,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
}
case RT_NODE_KIND_32:
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n32->base.n.count)
@@ -2141,7 +2238,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
}
case RT_NODE_KIND_125:
{
- rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2161,7 +2258,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
}
case RT_NODE_KIND_256:
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2183,7 +2280,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
if (found)
{
- rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
*value_p = value;
}
@@ -2220,16 +2317,16 @@ rt_memory_usage(radix_tree *tree)
* Verify the radix tree node.
*/
static void
-rt_verify_node(rt_node *node)
+rt_verify_node(rt_node_ptr node)
{
#ifdef USE_ASSERT_CHECKING
- Assert(node->count >= 0);
+ Assert(NODE_COUNT(node) >= 0);
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+ rt_node_base_4 *n4 = (rt_node_base_4 *) node.decoded;
for (int i = 1; i < n4->n.count; i++)
Assert(n4->chunks[i - 1] < n4->chunks[i]);
@@ -2238,7 +2335,7 @@ rt_verify_node(rt_node *node)
}
case RT_NODE_KIND_32:
{
- rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+ rt_node_base_32 *n32 = (rt_node_base_32 *) node.decoded;
for (int i = 1; i < n32->n.count; i++)
Assert(n32->chunks[i - 1] < n32->chunks[i]);
@@ -2247,7 +2344,7 @@ rt_verify_node(rt_node *node)
}
case RT_NODE_KIND_125:
{
- rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node.decoded;
int cnt = 0;
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2257,10 +2354,10 @@ rt_verify_node(rt_node *node)
/* Check if the corresponding slot is used */
if (NODE_IS_LEAF(node))
- Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) node,
+ Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) n125,
n125->slot_idxs[i]));
else
- Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) node,
+ Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) n125,
n125->slot_idxs[i]));
cnt++;
@@ -2273,7 +2370,7 @@ rt_verify_node(rt_node *node)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
int cnt = 0;
for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
@@ -2294,54 +2391,62 @@ rt_verify_node(rt_node *node)
void
rt_stats(radix_tree *tree)
{
+ rt_node *root = rt_pointer_decode(tree->root);
+
+ if (root == NULL)
+ return;
+
ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
- tree->num_keys,
- tree->root->shift / RT_NODE_SPAN,
- tree->cnt[RT_CLASS_4_FULL],
- tree->cnt[RT_CLASS_32_PARTIAL],
- tree->cnt[RT_CLASS_32_FULL],
- tree->cnt[RT_CLASS_125_FULL],
- tree->cnt[RT_CLASS_256])));
+ tree->num_keys,
+ root->shift / RT_NODE_SPAN,
+ tree->cnt[RT_CLASS_4_FULL],
+ tree->cnt[RT_CLASS_32_PARTIAL],
+ tree->cnt[RT_CLASS_32_FULL],
+ tree->cnt[RT_CLASS_125_FULL],
+ tree->cnt[RT_CLASS_256])));
}
static void
-rt_dump_node(rt_node *node, int level, bool recurse)
+rt_dump_node(rt_node_ptr node, int level, bool recurse)
{
- char space[125] = {0};
+ rt_node *n = node.decoded;
+ char space[128] = {0};
fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
NODE_IS_LEAF(node) ? "LEAF" : "INNR",
- (node->kind == RT_NODE_KIND_4) ? 4 :
- (node->kind == RT_NODE_KIND_32) ? 32 :
- (node->kind == RT_NODE_KIND_125) ? 125 : 256,
- node->fanout == 0 ? 256 : node->fanout,
- node->count, node->shift, node->chunk);
+
+ (n->kind == RT_NODE_KIND_4) ? 4 :
+ (n->kind == RT_NODE_KIND_32) ? 32 :
+ (n->kind == RT_NODE_KIND_125) ? 125 : 256,
+ n->fanout == 0 ? 256 : n->fanout,
+ n->count, n->shift, n->chunk);
if (level > 0)
sprintf(space, "%*c", level * 4, ' ');
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- for (int i = 0; i < node->count; i++)
+ for (int i = 0; i < NODE_COUNT(node); i++)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
space, n4->base.chunks[i], n4->values[i]);
}
else
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
fprintf(stderr, "%schunk 0x%X ->",
space, n4->base.chunks[i]);
if (recurse)
- rt_dump_node(n4->children[i], level + 1, recurse);
+ rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+ level + 1, recurse);
else
fprintf(stderr, "\n");
}
@@ -2350,25 +2455,26 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
case RT_NODE_KIND_32:
{
- for (int i = 0; i < node->count; i++)
+ for (int i = 0; i < NODE_KIND(node); i++)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
space, n32->base.chunks[i], n32->values[i]);
}
else
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
fprintf(stderr, "%schunk 0x%X ->",
space, n32->base.chunks[i]);
if (recurse)
{
- rt_dump_node(n32->children[i], level + 1, recurse);
+ rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+ level + 1, recurse);
}
else
fprintf(stderr, "\n");
@@ -2378,7 +2484,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
case RT_NODE_KIND_125:
{
- rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+ rt_node_base_125 *b125 = (rt_node_base_125 *) node.decoded;
fprintf(stderr, "slot_idxs ");
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2390,7 +2496,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+ rt_node_leaf_125 *n = (rt_node_leaf_125 *) node.decoded;
fprintf(stderr, ", isset-bitmap:");
for (int i = 0; i < WORDNUM(128); i++)
@@ -2420,7 +2526,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(node_inner_125_get_child(n125, i),
+ rt_dump_node(rt_node_ptr_encoded(node_inner_125_get_child(n125, i)),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2434,7 +2540,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
if (!node_leaf_256_is_chunk_used(n256, i))
continue;
@@ -2444,7 +2550,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
else
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
if (!node_inner_256_is_chunk_used(n256, i))
continue;
@@ -2453,8 +2559,8 @@ rt_dump_node(rt_node *node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
- recurse);
+ rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+ level + 1, recurse);
else
fprintf(stderr, "\n");
}
@@ -2467,7 +2573,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
void
rt_dump_search(radix_tree *tree, uint64 key)
{
- rt_node *node;
+ rt_node_ptr node;
int shift;
int level = 0;
@@ -2475,7 +2581,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
tree->max_val, tree->max_val);
- if (!tree->root)
+ if (!RTPointerIsValid(tree->root))
{
elog(NOTICE, "tree is empty");
return;
@@ -2488,11 +2594,11 @@ rt_dump_search(radix_tree *tree, uint64 key)
return;
}
- node = tree->root;
- shift = tree->root->shift;
+ node = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
while (shift >= 0)
{
- rt_node *child;
+ rt_pointer child;
rt_dump_node(node, level, false);
@@ -2509,7 +2615,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
break;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
level++;
}
@@ -2518,6 +2624,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
void
rt_dump(radix_tree *tree)
{
+ rt_node_ptr root;
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
@@ -2528,12 +2635,13 @@ rt_dump(radix_tree *tree)
rt_size_class_info[i].leaf_blocksize);
fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
- if (!tree->root)
+ if (!RTPointerIsValid(tree->root))
{
fprintf(stderr, "empty tree\n");
return;
}
- rt_dump_node(tree->root, 0, true);
+ root = rt_node_ptr_encoded(tree->root);
+ rt_dump_node(root, 0, true);
}
#endif
--
2.38.1
On Tue, Dec 6, 2022 at 7:32 PM John Naylor <john.naylor@enterprisedb.com> wrote:
On Fri, Dec 2, 2022 at 11:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Nov 14, 2022 at 7:59 PM John Naylor <john.naylor@enterprisedb.com> wrote:
- Optimize node128 insert.
I've attached a rough start at this. The basic idea is borrowed from our bitmapset nodes, so we can iterate over and operate on word-sized (32- or 64-bit) types at a time, rather than bytes.
Thanks! I think this is a good idea.
To make this easier, I've moved some of the lower-level macros and types from bitmapset.h/.c to pg_bitutils.h. That's probably going to need a separate email thread to resolve the coding style clash this causes, so that can be put off for later.
I started a separate thread [1], and 0002 comes from feedback on that. There is a FIXME about using WORDNUM and BITNUM, at least with that spelling. I'm putting that off to ease rebasing the rest as v13 -- getting some CI testing with 0002 seems like a good idea. There are no other changes yet. Next, I will take a look at templating local vs. shared memory. I might try basing that on the styles of both v12 and v8, and see which one works best with templating.
Thank you so much!
In the meanwhile, I've been working on vacuum integration. There are
two things I'd like to discuss some time:
The first is the minimum of maintenance_work_mem, 1 MB. Since the
initial DSA segment size is 1MB (DSA_INITIAL_SEGMENT_SIZE), parallel
vacuum with radix tree cannot work with the minimum
maintenance_work_mem. It will need to increase it to 4MB or so. Maybe
we can start a new thread for that.
The second is how to limit the size of the radix tree to
maintenance_work_mem. I think that it's tricky to estimate the maximum
number of keys in the radix tree that fit in maintenance_work_mem. The
radix tree size varies depending on the key distribution. The next
idea I considered was how to limit the size when inserting a key. In
order to strictly limit the radix tree size, probably we have to
change the rt_set so that it breaks off and returns false if the radix
tree size is about to exceed the memory limit when we allocate a new
node or grow a node kind/class. Ideally, I'd like to control the size
outside of radix tree (e.g. TIDStore) since it could introduce
overhead to rt_set() but probably we need to add such logic in radix
tree.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Fri, Dec 9, 2022 at 8:20 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
In the meanwhile, I've been working on vacuum integration. There are
two things I'd like to discuss some time:The first is the minimum of maintenance_work_mem, 1 MB. Since the
initial DSA segment size is 1MB (DSA_INITIAL_SEGMENT_SIZE), parallel
vacuum with radix tree cannot work with the minimum
maintenance_work_mem. It will need to increase it to 4MB or so. Maybe
we can start a new thread for that.
I don't think that'd be very controversial, but I'm also not sure why we'd
need 4MB -- can you explain in more detail what exactly we'd need so that
the feature would work? (The minimum doesn't have to work *well* IIUC, just
do some useful work and not fail).
The second is how to limit the size of the radix tree to
maintenance_work_mem. I think that it's tricky to estimate the maximum
number of keys in the radix tree that fit in maintenance_work_mem. The
radix tree size varies depending on the key distribution. The next
idea I considered was how to limit the size when inserting a key. In
order to strictly limit the radix tree size, probably we have to
change the rt_set so that it breaks off and returns false if the radix
tree size is about to exceed the memory limit when we allocate a new
node or grow a node kind/class.
That seems complex, fragile, and wrong scope.
Ideally, I'd like to control the size
outside of radix tree (e.g. TIDStore) since it could introduce
overhead to rt_set() but probably we need to add such logic in radix
tree.
Does the TIDStore have the ability to ask the DSA (or slab context) to see
how big it is? If a new segment has been allocated that brings us to the
limit, we can stop when we discover that fact. In the local case with slab
blocks, it won't be on nice neat boundaries, but we could check if we're
within the largest block size (~64kB) of overflow.
Remember when we discussed how we might approach parallel pruning? I
envisioned a local array of a few dozen kilobytes to reduce contention on
the tidstore. We could use such an array even for a single worker (always
doing the same thing is simpler anyway). When the array fills up enough so
that the next heap page *could* overflow it: Stop, insert into the store,
and check the store's memory usage before continuing.
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:
On Fri, Dec 9, 2022 at 8:20 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
In the meanwhile, I've been working on vacuum integration. There are
two things I'd like to discuss some time:The first is the minimum of maintenance_work_mem, 1 MB. Since the
initial DSA segment size is 1MB (DSA_INITIAL_SEGMENT_SIZE), parallel
vacuum with radix tree cannot work with the minimum
maintenance_work_mem. It will need to increase it to 4MB or so. Maybe
we can start a new thread for that.I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more detail what exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do some useful work and not fail).
The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
the radix tree is using dsa_get_total_size(). If the size returned by
dsa_get_total_size() (+ some memory used by TIDStore meta information)
exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
and heap vacuum. However, when allocating DSA memory for
radix_tree_control at creation, we allocate 1MB
(DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
radix_tree_control from it. das_get_total_size() returns 1MB even if
there is no TID collected.
The second is how to limit the size of the radix tree to
maintenance_work_mem. I think that it's tricky to estimate the maximum
number of keys in the radix tree that fit in maintenance_work_mem. The
radix tree size varies depending on the key distribution. The next
idea I considered was how to limit the size when inserting a key. In
order to strictly limit the radix tree size, probably we have to
change the rt_set so that it breaks off and returns false if the radix
tree size is about to exceed the memory limit when we allocate a new
node or grow a node kind/class.That seems complex, fragile, and wrong scope.
Ideally, I'd like to control the size
outside of radix tree (e.g. TIDStore) since it could introduce
overhead to rt_set() but probably we need to add such logic in radix
tree.Does the TIDStore have the ability to ask the DSA (or slab context) to see how big it is?
Yes, TIDStore can check it using dsa_get_total_size().
If a new segment has been allocated that brings us to the limit, we can stop when we discover that fact. In the local case with slab blocks, it won't be on nice neat boundaries, but we could check if we're within the largest block size (~64kB) of overflow.
Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytes to reduce contention on the tidstore. We could use such an array even for a single worker (always doing the same thing is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insert into the store, and check the store's memory usage before continuing.
Right, I think it's no problem in slab cases. In DSA cases, the new
segment size follows a geometric series that approximately doubles the
total storage each time we create a new segment. This behavior comes
from the fact that the underlying DSM system isn't designed for large
numbers of segments.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com>
wrote:
I don't think that'd be very controversial, but I'm also not sure why
we'd need 4MB -- can you explain in more detail what exactly we'd need so
that the feature would work? (The minimum doesn't have to work *well* IIUC,
just do some useful work and not fail).
The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
the radix tree is using dsa_get_total_size(). If the size returned by
dsa_get_total_size() (+ some memory used by TIDStore meta information)
exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
and heap vacuum. However, when allocating DSA memory for
radix_tree_control at creation, we allocate 1MB
(DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
radix_tree_control from it. das_get_total_size() returns 1MB even if
there is no TID collected.
2MB makes sense.
If the metadata is small, it seems counterproductive to count it towards
the total. We want the decision to be driven by blocks allocated. I have an
idea on that below.
Remember when we discussed how we might approach parallel pruning? I
envisioned a local array of a few dozen kilobytes to reduce contention on
the tidstore. We could use such an array even for a single worker (always
doing the same thing is simpler anyway). When the array fills up enough so
that the next heap page *could* overflow it: Stop, insert into the store,
and check the store's memory usage before continuing.
Right, I think it's no problem in slab cases. In DSA cases, the new
segment size follows a geometric series that approximately doubles the
total storage each time we create a new segment. This behavior comes
from the fact that the underlying DSM system isn't designed for large
numbers of segments.
And taking a look, the size of a new segment can get quite large. It seems
we could test if the total DSA area allocated is greater than half of
maintenance_work_mem. If that parameter is a power of two (common) and
=8MB, then the area will contain just under a power of two the last time
it passes the test. The next segment will bring it to about 3/4 full, like
this:
maintenance work mem = 256MB, so stop if we go over 128MB:
2*(1+2+4+8+16+32) = 126MB -> keep going
126MB + 64 = 190MB -> stop
That would be a simple way to be conservative with the memory limit. The
unfortunate aspect is that the last segment would be mostly wasted, but
it's paradise compared to the pessimistically-sized single array we have
now (even with Peter G.'s VM snapshot informing the allocation size, I
imagine).
And as for minimum possible maintenance work mem, I think this would work
with 2MB, if the community is okay with technically going over the limit by
a few bytes of overhead if a buildfarm animal set to that value. I imagine
it would never go over the limit for realistic (and even most unrealistic)
values. Even with a VM snapshot page in memory and small local arrays of
TIDs, I think with this scheme we'll be well under the limit.
After this feature is complete, I think we should consider a follow-on
patch to get rid of vacuum_work_mem, since it would no longer be needed.
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Dec 12, 2022 at 7:14 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:
I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more detail what exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do some useful work and not fail).
The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
the radix tree is using dsa_get_total_size(). If the size returned by
dsa_get_total_size() (+ some memory used by TIDStore meta information)
exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
and heap vacuum. However, when allocating DSA memory for
radix_tree_control at creation, we allocate 1MB
(DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
radix_tree_control from it. das_get_total_size() returns 1MB even if
there is no TID collected.2MB makes sense.
If the metadata is small, it seems counterproductive to count it towards the total. We want the decision to be driven by blocks allocated. I have an idea on that below.
Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytes to reduce contention on the tidstore. We could use such an array even for a single worker (always doing the same thing is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insert into the store, and check the store's memory usage before continuing.
Right, I think it's no problem in slab cases. In DSA cases, the new
segment size follows a geometric series that approximately doubles the
total storage each time we create a new segment. This behavior comes
from the fact that the underlying DSM system isn't designed for large
numbers of segments.And taking a look, the size of a new segment can get quite large. It seems we could test if the total DSA area allocated is greater than half of maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, then the area will contain just under a power of two the last time it passes the test. The next segment will bring it to about 3/4 full, like this:
maintenance work mem = 256MB, so stop if we go over 128MB:
2*(1+2+4+8+16+32) = 126MB -> keep going
126MB + 64 = 190MB -> stopThat would be a simple way to be conservative with the memory limit. The unfortunate aspect is that the last segment would be mostly wasted, but it's paradise compared to the pessimistically-sized single array we have now (even with Peter G.'s VM snapshot informing the allocation size, I imagine).
Right. In this case, even if we allocate 64MB, we will use only 2088
bytes at maximum. So I think the memory space used for vacuum is
practically limited to half.
And as for minimum possible maintenance work mem, I think this would work with 2MB, if the community is okay with technically going over the limit by a few bytes of overhead if a buildfarm animal set to that value. I imagine it would never go over the limit for realistic (and even most unrealistic) values. Even with a VM snapshot page in memory and small local arrays of TIDs, I think with this scheme we'll be well under the limit.
Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
seems that they look at only memory that are actually dsa_allocate'd.
To be exact, we estimate the number of hash buckets based on work_mem
(and hash_mem_multiplier) and use it as the upper limit. So I've
confirmed that the result of dsa_get_total_size() could exceed the
limit. I'm not sure it's a known and legitimate usage. If we can
follow such usage, we can probably track how much dsa_allocate'd
memory is used in the radix tree. Templating whether or not to count
the memory usage might help avoid the overheads.
After this feature is complete, I think we should consider a follow-on patch to get rid of vacuum_work_mem, since it would no longer be needed.
I think you meant autovacuum_work_mem. Yes, I also think we can get rid of it.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Dec 12, 2022 at 7:14 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:
I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more detail what exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do some useful work and not fail).
The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
the radix tree is using dsa_get_total_size(). If the size returned by
dsa_get_total_size() (+ some memory used by TIDStore meta information)
exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
and heap vacuum. However, when allocating DSA memory for
radix_tree_control at creation, we allocate 1MB
(DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
radix_tree_control from it. das_get_total_size() returns 1MB even if
there is no TID collected.2MB makes sense.
If the metadata is small, it seems counterproductive to count it towards the total. We want the decision to be driven by blocks allocated. I have an idea on that below.
Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytes to reduce contention on the tidstore. We could use such an array even for a single worker (always doing the same thing is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insert into the store, and check the store's memory usage before continuing.
Right, I think it's no problem in slab cases. In DSA cases, the new
segment size follows a geometric series that approximately doubles the
total storage each time we create a new segment. This behavior comes
from the fact that the underlying DSM system isn't designed for large
numbers of segments.And taking a look, the size of a new segment can get quite large. It seems we could test if the total DSA area allocated is greater than half of maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, then the area will contain just under a power of two the last time it passes the test. The next segment will bring it to about 3/4 full, like this:
maintenance work mem = 256MB, so stop if we go over 128MB:
2*(1+2+4+8+16+32) = 126MB -> keep going
126MB + 64 = 190MB -> stopThat would be a simple way to be conservative with the memory limit. The unfortunate aspect is that the last segment would be mostly wasted, but it's paradise compared to the pessimistically-sized single array we have now (even with Peter G.'s VM snapshot informing the allocation size, I imagine).
Right. In this case, even if we allocate 64MB, we will use only 2088
bytes at maximum. So I think the memory space used for vacuum is
practically limited to half.And as for minimum possible maintenance work mem, I think this would work with 2MB, if the community is okay with technically going over the limit by a few bytes of overhead if a buildfarm animal set to that value. I imagine it would never go over the limit for realistic (and even most unrealistic) values. Even with a VM snapshot page in memory and small local arrays of TIDs, I think with this scheme we'll be well under the limit.
Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
seems that they look at only memory that are actually dsa_allocate'd.
To be exact, we estimate the number of hash buckets based on work_mem
(and hash_mem_multiplier) and use it as the upper limit. So I've
confirmed that the result of dsa_get_total_size() could exceed the
limit. I'm not sure it's a known and legitimate usage. If we can
follow such usage, we can probably track how much dsa_allocate'd
memory is used in the radix tree.
I've experimented with this idea. The newly added 0008 patch changes
the radix tree so that it counts the memory usage for both local and
shared cases. As shown below, there is an overhead for that:
w/o 0008 patch
=# select * from bench_load_random_int(1000000)
NOTICE: num_keys = 1000000, height = 7, n4 = 4970924, n15 = 38277,
n32 = 27205, n125 = 0, n256 = 257
mem_allocated | load_ms
---------------+---------
298453544 | 282
(1 row)
w/0 0008 patch
=# select * from bench_load_random_int(1000000)
NOTICE: num_keys = 1000000, height = 7, n4 = 4970924, n15 = 38277,
n32 = 27205, n125 = 0, n256 = 257
mem_allocated | load_ms
---------------+---------
293603184 | 297
(1 row)
Although it adds some overhead, I think this idea is straightforward
and the most practical for users. And it seems to be consistent with
other components using DSA. We can improve this part in the future for
better memory control, for example, by introducing slab-like DSA
memory management.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v14-0005-tool-for-measuring-radix-tree-performance.patchapplication/octet-stream; name=v14-0005-tool-for-measuring-radix-tree-performance.patchDownload
From 75af1182c7107486db3846e616625e456d640e3c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v14 5/9] tool for measuring radix tree performance
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 76 +++
contrib/bench_radix_tree/bench_radix_tree.c | 635 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
6 files changed, 767 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..83529805fc
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..a0693695e6
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,635 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 search_time_ms;
+ Datum values[2] = {0};
+ bool nulls[2] = {0};
+ /* from trial and error */
+ uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (!PG_ARGISNULL(1))
+ {
+ char *filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+ if (sscanf(filter_str, "0x%lX", &filter) == 0)
+ elog(ERROR, "invalid filter string %s", filter_str);
+ }
+ elog(NOTICE, "bench with filter 0x%lX", filter);
+
+ rt = rt_create(CurrentMemoryContext);
+
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+
+ rt_set(rt, key, key);
+ }
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_sparseload_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ uint64 r,
+ h;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ key_id = 0;
+
+ for (r = 1; r <= fanout; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (h);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+
+ rt_stats(rt);
+
+ /* measure sparse deletion and re-loading */
+ start_time = GetCurrentTimestamp();
+
+ for (int t = 0; t<10000; t++)
+ {
+ /* delete one key in each leaf */
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ rt_delete(rt, key);
+ }
+
+ /* add them all back */
+ key_id = 0;
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_sparseload_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
--
2.31.1
v14-0008-PoC-calculate-memory-usage-in-radix-tree.patchapplication/octet-stream; name=v14-0008-PoC-calculate-memory-usage-in-radix-tree.patchDownload
From 8ec7c3f15da739c1a8d78c1eec1e1f45cbe8ba21 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 19 Dec 2022 14:41:43 +0900
Subject: [PATCH v14 8/9] PoC: calculate memory usage in radix tree.
---
src/backend/lib/radixtree.c | 137 +++++++++++++++++++++++------------
src/backend/utils/mmgr/dsa.c | 42 +++++++++++
src/include/utils/dsa.h | 1 +
3 files changed, 135 insertions(+), 45 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 455071cbab..4ad55a0b7c 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -360,14 +360,24 @@ typedef struct rt_size_class_elem
const char *name;
int fanout;
- /* slab chunk size */
+ /* node size */
Size inner_size;
Size leaf_size;
/* slab block size */
- Size inner_blocksize;
- Size leaf_blocksize;
+ Size slab_inner_blocksize;
+ Size slab_leaf_blocksize;
+
+ /*
+ * We can get how much memory is allocated for a radix tree node using
+ * GetMemoryChunkSpace() for the local radix tree case. However, in the
+ * shared case, since DSA doesn't have such functionality we prepare the
+ * node size that are allocated in DSA for memory calculation.
+ */
+ Size dsa_inner_size;
+ Size dsa_leaf_size;
} rt_size_class_elem;
+static bool rt_size_class_dsa_info_initialized = false;
/*
* Calculate the slab blocksize so that we can allocate at least 32 chunks
@@ -381,40 +391,40 @@ static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
.fanout = 4,
.inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
.leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+ .slab_inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+ .slab_leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
},
[RT_CLASS_32_PARTIAL] = {
.name = "radix tree node 15",
.fanout = 15,
.inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
.leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
+ .slab_inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+ .slab_leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
},
[RT_CLASS_32_FULL] = {
.name = "radix tree node 32",
.fanout = 32,
.inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
.leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+ .slab_inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+ .slab_leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
},
[RT_CLASS_125_FULL] = {
.name = "radix tree node 125",
.fanout = 125,
.inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
.leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
+ .slab_inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
+ .slab_leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
},
[RT_CLASS_256] = {
.name = "radix tree node 256",
.fanout = 256,
.inner_size = sizeof(rt_node_inner_256),
.leaf_size = sizeof(rt_node_leaf_256),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+ .slab_inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+ .slab_leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
},
};
@@ -477,6 +487,12 @@ typedef struct radix_tree_control
uint64 max_val;
uint64 num_keys;
+ /*
+ * Track the amount of memory used. The callers can ask for it
+ * with rt_memory_usage().
+ */
+ uint64 mem_used;
+
/* statistics */
#ifdef RT_DEBUG
int32 cnt[RT_SIZE_CLASS_COUNT];
@@ -1005,15 +1021,22 @@ static rt_node_ptr
rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
{
rt_node_ptr newnode;
+ Size size;
if (RadixTreeIsShared(tree))
{
dsa_pointer dp;
if (inner)
+ {
dp = dsa_allocate(tree->area, rt_size_class_info[size_class].inner_size);
+ size = rt_size_class_info[size_class].dsa_inner_size;
+ }
else
+ {
dp = dsa_allocate(tree->area, rt_size_class_info[size_class].leaf_size);
+ size = rt_size_class_info[size_class].dsa_leaf_size;
+ }
newnode.encoded = (rt_pointer) dp;
newnode.decoded = rt_pointer_decode(tree, newnode.encoded);
@@ -1028,8 +1051,12 @@ rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
rt_size_class_info[size_class].leaf_size);
newnode.encoded = rt_pointer_encode(newnode.decoded);
+ size = GetMemoryChunkSpace(newnode.decoded);
}
+ /* update memory usage */
+ tree->ctl->mem_used += size;
+
#ifdef RT_DEBUG
/* update the statistics */
tree->ctl->cnt[size_class]++;
@@ -1095,6 +1122,15 @@ rt_grow_node_kind(radix_tree *tree, rt_node_ptr node, uint8 new_kind)
static void
rt_free_node(radix_tree *tree, rt_node_ptr node)
{
+ int size;
+ static const int fanout_node_class[RT_NODE_MAX_SLOTS] =
+ {
+ [4] = RT_CLASS_4_FULL,
+ [15] = RT_CLASS_32_PARTIAL,
+ [32] = RT_CLASS_32_FULL,
+ [125] = RT_CLASS_125_FULL,
+ };
+
/* If we're deleting the root node, make the tree empty */
if (tree->ctl->root == node.encoded)
{
@@ -1104,28 +1140,38 @@ rt_free_node(radix_tree *tree, rt_node_ptr node)
#ifdef RT_DEBUG
{
- int i;
+ int size_class = (NODE_FANOUT(node) == 0)
+ ? RT_CLASS_256
+ : fanout_node_class[NODE_FANOUT(node)];
/* update the statistics */
- for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
- {
- if (NODE_FANOUT(node) == rt_size_class_info[i].fanout)
- break;
- }
-
- /* fanout of node256 is intentionally 0 */
- if (i == RT_SIZE_CLASS_COUNT)
- i = RT_CLASS_256;
-
- tree->ctl->cnt[i]--;
- Assert(tree->ctl->cnt[i] >= 0);
+ tree->ctl->cnt[size_class]--;
+ Assert(tree->ctl->cnt[size_class] >= 0);
}
#endif
if (RadixTreeIsShared(tree))
+ {
+ int size_class = (NODE_FANOUT(node) == 0)
+ ? RT_CLASS_256
+ : fanout_node_class[NODE_FANOUT(node)];
+
+ if (!NODE_IS_LEAF(node))
+ size = rt_size_class_info[size_class].dsa_inner_size;
+ else
+ size = rt_size_class_info[size_class].dsa_leaf_size;
+
dsa_free(tree->area, (dsa_pointer) node.encoded);
+ }
else
+ {
+ size = GetMemoryChunkSpace(node.decoded);
pfree(node.decoded);
+ }
+
+ /* update memory usage */
+ tree->ctl->mem_used -= size;
+ Assert(tree->ctl->mem_used > 0);
}
/*
@@ -1837,15 +1883,18 @@ rt_create(MemoryContext ctx, dsa_area *area)
dp = dsa_allocate0(area, sizeof(radix_tree_control));
tree->ctl = (radix_tree_control *) dsa_get_address(area, dp);
tree->ctl->handle = (rt_handle) dp;
+ tree->ctl->mem_used += dsa_get_size_class(sizeof(radix_tree_control));
}
else
{
tree->ctl = (radix_tree_control *) palloc0(sizeof(radix_tree_control));
tree->ctl->handle = InvalidDsaPointer;
+ tree->ctl->mem_used += GetMemoryChunkSpace(tree->ctl);
}
tree->ctl->magic = RADIXTREE_MAGIC;
tree->ctl->root = InvalidRTPointer;
+ tree->ctl->mem_used = GetMemoryChunkSpace(tree);
/* Create the slab allocator for each size class */
if (area == NULL)
@@ -1854,17 +1903,29 @@ rt_create(MemoryContext ctx, dsa_area *area)
{
tree->inner_slabs[i] = SlabContextCreate(ctx,
rt_size_class_info[i].name,
- rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].slab_inner_blocksize,
rt_size_class_info[i].inner_size);
tree->leaf_slabs[i] = SlabContextCreate(ctx,
rt_size_class_info[i].name,
- rt_size_class_info[i].leaf_blocksize,
+ rt_size_class_info[i].slab_leaf_blocksize,
rt_size_class_info[i].leaf_size);
#ifdef RT_DEBUG
tree->ctl->cnt[i] = 0;
#endif
}
}
+ else if (!rt_size_class_dsa_info_initialized)
+ {
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ rt_size_class_info[i].dsa_inner_size =
+ dsa_get_size_class(rt_size_class_info[i].inner_size);
+ rt_size_class_info[i].dsa_leaf_size =
+ dsa_get_size_class(rt_size_class_info[i].leaf_size);
+ }
+
+ rt_size_class_dsa_info_initialized = true;
+ }
MemoryContextSwitchTo(old_ctx);
@@ -2534,22 +2595,8 @@ rt_num_entries(radix_tree *tree)
uint64
rt_memory_usage(radix_tree *tree)
{
- Size total = sizeof(radix_tree) + sizeof(radix_tree_control);
-
Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
-
- if (RadixTreeIsShared(tree))
- total = dsa_get_total_size(tree->area);
- else
- {
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
- {
- total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
- total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
- }
- }
-
- return total;
+ return tree->ctl->mem_used;
}
/*
@@ -2873,9 +2920,9 @@ rt_dump(radix_tree *tree)
fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
rt_size_class_info[i].name,
rt_size_class_info[i].inner_size,
- rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].slab_inner_blocksize,
rt_size_class_info[i].leaf_size,
- rt_size_class_info[i].leaf_blocksize);
+ rt_size_class_info[i].slab_leaf_blocksize);
fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
if (!RTPointerIsValid(tree->ctl->root))
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index ad169882af..e77aea10e2 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1208,6 +1208,48 @@ dsa_minimum_size(void)
return pages * FPM_PAGE_SIZE;
}
+size_t
+dsa_get_size_class(size_t size)
+{
+ uint16 size_class;
+
+ if (size > dsa_size_classes[lengthof(dsa_size_classes) - 1])
+ return size;
+ else if (size < lengthof(dsa_size_class_map) * DSA_SIZE_CLASS_MAP_QUANTUM)
+ {
+ int mapidx;
+
+ /* For smaller sizes we have a lookup table... */
+ mapidx = ((size + DSA_SIZE_CLASS_MAP_QUANTUM - 1) /
+ DSA_SIZE_CLASS_MAP_QUANTUM) - 1;
+ size_class = dsa_size_class_map[mapidx];
+ }
+ else
+ {
+ uint16 min;
+ uint16 max;
+
+ /* ... and for the rest we search by binary chop. */
+ min = dsa_size_class_map[lengthof(dsa_size_class_map) - 1];
+ max = lengthof(dsa_size_classes) - 1;
+
+ while (min < max)
+ {
+ uint16 mid = (min + max) / 2;
+ uint16 class_size = dsa_size_classes[mid];
+
+ if (class_size < size)
+ min = mid + 1;
+ else
+ max = mid;
+ }
+
+ size_class = min;
+ }
+
+ return dsa_size_classes[size_class];
+}
+
/*
* Workhorse function for dsa_create and dsa_create_in_place.
*/
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index dad06adecc..a17c4eb88c 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -118,6 +118,7 @@ extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags)
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
extern size_t dsa_get_total_size(dsa_area *area);
+extern size_t dsa_get_size_class(size_t size);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
--
2.31.1
v14-0006-Use-rt_node_ptr-to-reference-radix-tree-nodes.patchapplication/octet-stream; name=v14-0006-Use-rt_node_ptr-to-reference-radix-tree-nodes.patchDownload
From 7e5fd8a19adb0305f77618231364eacaa2e0a59a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 14 Nov 2022 11:44:17 +0900
Subject: [PATCH v14 6/9] Use rt_node_ptr to reference radix tree nodes.
---
src/backend/lib/radixtree.c | 688 +++++++++++++++++++++---------------
1 file changed, 398 insertions(+), 290 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index abd0450727..bff37a2c35 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -150,6 +150,19 @@ typedef enum rt_size_class
#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
} rt_size_class;
+/*
+ * rt_pointer is a pointer compatible with a pointer to local memory and a
+ * pointer for DSA area (i.e. dsa_pointer). Since the radix tree node can be
+ * allocated in backend local memory as well as DSA area, we cannot use a
+ * C-pointer to rt_node (i.e. backend local memory address) for child pointers
+ * in inner nodes. Inner nodes need to use rt_pointer instead. We can get
+ * the backend local memory address of a node from a rt_pointer by using
+ * rt_pointer_decode().
+*/
+typedef uintptr_t rt_pointer;
+#define InvalidRTPointer ((rt_pointer) 0)
+#define RTPointerIsValid(x) (((rt_pointer) (x)) != InvalidRTPointer)
+
/* Common type for all nodes types */
typedef struct rt_node
{
@@ -175,8 +188,7 @@ typedef struct rt_node
/* Node kind, one per search/set algorithm */
uint8 kind;
} rt_node;
-#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
-#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
+#define RT_NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
#define VAR_NODE_HAS_FREE_SLOT(node) \
((node)->base.n.count < (node)->base.n.fanout)
#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
@@ -240,7 +252,7 @@ typedef struct rt_node_inner_4
rt_node_base_4 base;
/* number of children depends on size class */
- rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+ rt_pointer children[FLEXIBLE_ARRAY_MEMBER];
} rt_node_inner_4;
typedef struct rt_node_leaf_4
@@ -256,7 +268,7 @@ typedef struct rt_node_inner_32
rt_node_base_32 base;
/* number of children depends on size class */
- rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+ rt_pointer children[FLEXIBLE_ARRAY_MEMBER];
} rt_node_inner_32;
typedef struct rt_node_leaf_32
@@ -272,7 +284,7 @@ typedef struct rt_node_inner_125
rt_node_base_125 base;
/* number of children depends on size class */
- rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+ rt_pointer children[FLEXIBLE_ARRAY_MEMBER];
} rt_node_inner_125;
typedef struct rt_node_leaf_125
@@ -292,7 +304,7 @@ typedef struct rt_node_inner_256
rt_node_base_256 base;
/* Slots for 256 children */
- rt_node *children[RT_NODE_MAX_SLOTS];
+ rt_pointer children[RT_NODE_MAX_SLOTS];
} rt_node_inner_256;
typedef struct rt_node_leaf_256
@@ -306,6 +318,29 @@ typedef struct rt_node_leaf_256
uint64 values[RT_NODE_MAX_SLOTS];
} rt_node_leaf_256;
+/* rt_node_ptr is a data structure representing a pointer for a rt_node */
+typedef struct rt_node_ptr
+{
+ rt_pointer encoded;
+ rt_node *decoded;
+} rt_node_ptr;
+#define InvalidRTNodePtr \
+ (rt_node_ptr) {.encoded = InvalidRTPointer, .decoded = NULL}
+#define RTNodePtrIsValid(n) \
+ (!rt_node_ptr_eq((rt_node_ptr *) &(n), &(InvalidRTNodePtr)))
+
+/* Macros for rt_node_ptr to access the fields of rt_node */
+#define NODE_RAW(n) (n.decoded)
+#define NODE_IS_LEAF(n) (NODE_RAW(n)->shift == 0)
+#define NODE_IS_EMPTY(n) (NODE_COUNT(n) == 0)
+#define NODE_KIND(n) (NODE_RAW(n)->kind)
+#define NODE_COUNT(n) (NODE_RAW(n)->count)
+#define NODE_SHIFT(n) (NODE_RAW(n)->shift)
+#define NODE_CHUNK(n) (NODE_RAW(n)->chunk)
+#define NODE_FANOUT(n) (NODE_RAW(n)->fanout)
+#define NODE_HAS_FREE_SLOT(n) \
+ (NODE_COUNT(n) < rt_node_kind_info[NODE_KIND(n)].fanout)
+
/* Information for each size class */
typedef struct rt_size_class_elem
{
@@ -394,7 +429,7 @@ static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
*/
typedef struct rt_node_iter
{
- rt_node *node; /* current node being iterated */
+ rt_node_ptr node; /* current node being iterated */
int current_idx; /* current position. -1 for initial value */
} rt_node_iter;
@@ -415,7 +450,7 @@ struct radix_tree
{
MemoryContext context;
- rt_node *root;
+ rt_pointer root;
uint64 max_val;
uint64 num_keys;
@@ -429,27 +464,58 @@ struct radix_tree
};
static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
-static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+
+static rt_node_ptr rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node_ptr node, uint8 kind, rt_size_class size_class,
bool inner);
-static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_free_node(radix_tree *tree, rt_node_ptr node);
static void rt_extend(radix_tree *tree, uint64 key);
-static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
- rt_node **child_p);
-static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+static inline bool rt_node_search_inner(rt_node_ptr node_ptr, uint64 key, rt_action action,
+ rt_pointer *child_p);
+static inline bool rt_node_search_leaf(rt_node_ptr node_ptr, uint64 key, rt_action action,
uint64 *value_p);
-static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
- uint64 key, rt_node *child);
-static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+static bool rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+ uint64 key, rt_node_ptr child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
uint64 key, uint64 value);
-static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ rt_node_ptr *child_p);
static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
uint64 *value_p);
-static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static void rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from);
static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
/* verification (available only with assertion) */
-static void rt_verify_node(rt_node *node);
+static void rt_verify_node(rt_node_ptr node);
+
+/* Decode and encode functions of rt_pointer */
+static inline rt_node *
+rt_pointer_decode(rt_pointer encoded)
+{
+ return (rt_node *) encoded;
+}
+
+static inline rt_pointer
+rt_pointer_encode(rt_node *decoded)
+{
+ return (rt_pointer) decoded;
+}
+
+/* Return a rt_node_ptr created from the given encoded pointer */
+static inline rt_node_ptr
+rt_node_ptr_encoded(rt_pointer encoded)
+{
+ return (rt_node_ptr) {
+ .encoded = encoded,
+ .decoded = rt_pointer_decode(encoded),
+ };
+}
+
+static inline bool
+rt_node_ptr_eq(rt_node_ptr *a, rt_node_ptr *b)
+{
+ return (a->decoded == b->decoded) && (a->encoded == b->encoded);
+}
/*
* Return index of the first element in 'base' that equals 'key'. Return -1
@@ -598,10 +664,10 @@ node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
/* Shift the elements right at 'idx' by one */
static inline void
-chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_shift(uint8 *chunks, rt_pointer *children, int count, int idx)
{
memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
- memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_pointer) * (count - idx));
}
static inline void
@@ -613,10 +679,10 @@ chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
/* Delete the element at 'idx' */
static inline void
-chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_delete(uint8 *chunks, rt_pointer *children, int count, int idx)
{
memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
- memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_pointer) * (count - idx - 1));
}
static inline void
@@ -628,12 +694,12 @@ chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
/* Copy both chunks and children/values arrays */
static inline void
-chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
- uint8 *dst_chunks, rt_node **dst_children)
+chunk_children_array_copy(uint8 *src_chunks, rt_pointer *src_children,
+ uint8 *dst_chunks, rt_pointer *dst_children)
{
const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
- const Size children_size = sizeof(rt_node *) * fanout;
+ const Size children_size = sizeof(rt_pointer) * fanout;
memcpy(dst_chunks, src_chunks, chunk_size);
memcpy(dst_children, src_children, children_size);
@@ -665,7 +731,7 @@ node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
static inline bool
node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
Assert(slot < node->base.n.fanout);
return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
}
@@ -673,23 +739,23 @@ node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
static inline bool
node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
Assert(slot < node->base.n.fanout);
return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
}
#endif
-static inline rt_node *
+static inline rt_pointer
node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
return node->children[node->base.slot_idxs[chunk]];
}
static inline uint64
node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
return node->values[node->base.slot_idxs[chunk]];
}
@@ -699,9 +765,9 @@ node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
{
int slotpos = node->base.slot_idxs[chunk];
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
- node->children[node->base.slot_idxs[chunk]] = NULL;
+ node->children[node->base.slot_idxs[chunk]] = InvalidRTPointer;
node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
}
@@ -710,7 +776,7 @@ node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
{
int slotpos = node->base.slot_idxs[chunk];
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
}
@@ -742,11 +808,11 @@ node_125_find_unused_slot(bitmapword *isset)
}
static inline void
-node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_pointer child)
{
int slotpos;
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
slotpos = node_125_find_unused_slot(node->base.isset);
Assert(slotpos < node->base.n.fanout);
@@ -761,7 +827,7 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
{
int slotpos;
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
slotpos = node_125_find_unused_slot(node->base.isset);
Assert(slotpos < node->base.n.fanout);
@@ -772,16 +838,16 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
/* Update the child corresponding to 'chunk' to 'child' */
static inline void
-node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_pointer child)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->children[node->base.slot_idxs[chunk]] = child;
}
static inline void
node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->values[node->base.slot_idxs[chunk]] = value;
}
@@ -791,21 +857,21 @@ node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
static inline bool
node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
- return (node->children[chunk] != NULL);
+ Assert(!RT_NODE_IS_LEAF(node));
+ return RTPointerIsValid(node->children[chunk]);
}
static inline bool
node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
}
-static inline rt_node *
+static inline rt_pointer
node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
Assert(node_inner_256_is_chunk_used(node, chunk));
return node->children[chunk];
}
@@ -813,16 +879,16 @@ node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
static inline uint64
node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
Assert(node_leaf_256_is_chunk_used(node, chunk));
return node->values[chunk];
}
/* Set the child in the node-256 */
static inline void
-node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_pointer child)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->children[chunk] = child;
}
@@ -830,7 +896,7 @@ node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
static inline void
node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
node->values[chunk] = value;
}
@@ -839,14 +905,14 @@ node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
static inline void
node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
- node->children[chunk] = NULL;
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = InvalidRTPointer;
}
static inline void
node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
}
@@ -882,29 +948,32 @@ rt_new_root(radix_tree *tree, uint64 key)
{
int shift = key_get_shift(key);
bool inner = shift > 0;
- rt_node *newnode;
+ rt_node_ptr newnode;
newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
- newnode->shift = shift;
+ NODE_SHIFT(newnode) = shift;
+
tree->max_val = shift_get_max_val(shift);
- tree->root = newnode;
+ tree->root = newnode.encoded;
}
/*
* Allocate a new node with the given node kind.
*/
-static rt_node *
+static rt_node_ptr
rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
{
- rt_node *newnode;
+ rt_node_ptr newnode;
if (inner)
- newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
- rt_size_class_info[size_class].inner_size);
+ newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+ rt_size_class_info[size_class].inner_size);
else
- newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
- rt_size_class_info[size_class].leaf_size);
+ newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ rt_size_class_info[size_class].leaf_size);
+
+ newnode.encoded = rt_pointer_encode(newnode.decoded);
#ifdef RT_DEBUG
/* update the statistics */
@@ -916,20 +985,20 @@ rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
/* Initialize the node contents */
static inline void
-rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+rt_init_node(rt_node_ptr node, uint8 kind, rt_size_class size_class, bool inner)
{
if (inner)
- MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+ MemSet(node.decoded, 0, rt_size_class_info[size_class].inner_size);
else
- MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+ MemSet(node.decoded, 0, rt_size_class_info[size_class].leaf_size);
- node->kind = kind;
- node->fanout = rt_size_class_info[size_class].fanout;
+ NODE_KIND(node) = kind;
+ NODE_FANOUT(node) = rt_size_class_info[size_class].fanout;
/* Initialize slot_idxs to invalid values */
if (kind == RT_NODE_KIND_125)
{
- rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node.decoded;
memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
}
@@ -939,25 +1008,25 @@ rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
* and this is the max size class to it will never grow.
*/
if (kind == RT_NODE_KIND_256)
- node->fanout = 0;
+ NODE_FANOUT(node) = 0;
}
static inline void
-rt_copy_node(rt_node *newnode, rt_node *oldnode)
+rt_copy_node(rt_node_ptr newnode, rt_node_ptr oldnode)
{
- newnode->shift = oldnode->shift;
- newnode->chunk = oldnode->chunk;
- newnode->count = oldnode->count;
+ NODE_SHIFT(newnode) = NODE_SHIFT(oldnode);
+ NODE_CHUNK(newnode) = NODE_CHUNK(oldnode);
+ NODE_COUNT(newnode) = NODE_COUNT(oldnode);
}
/*
* Create a new node with 'new_kind' and the same shift, chunk, and
* count of 'node'.
*/
-static rt_node*
-rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+static rt_node_ptr
+rt_grow_node_kind(radix_tree *tree, rt_node_ptr node, uint8 new_kind)
{
- rt_node *newnode;
+ rt_node_ptr newnode;
bool inner = !NODE_IS_LEAF(node);
newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
@@ -969,12 +1038,12 @@ rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
/* Free the given node */
static void
-rt_free_node(radix_tree *tree, rt_node *node)
+rt_free_node(radix_tree *tree, rt_node_ptr node)
{
/* If we're deleting the root node, make the tree empty */
- if (tree->root == node)
+ if (tree->root == node.encoded)
{
- tree->root = NULL;
+ tree->root = InvalidRTPointer;
tree->max_val = 0;
}
@@ -985,7 +1054,7 @@ rt_free_node(radix_tree *tree, rt_node *node)
/* update the statistics */
for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
- if (node->fanout == rt_size_class_info[i].fanout)
+ if (NODE_FANOUT(node) == rt_size_class_info[i].fanout)
break;
}
@@ -998,29 +1067,30 @@ rt_free_node(radix_tree *tree, rt_node *node)
}
#endif
- pfree(node);
+ pfree(node.decoded);
}
/*
* Replace old_child with new_child, and free the old one.
*/
static void
-rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
- rt_node *new_child, uint64 key)
+rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
+ rt_node_ptr new_child, uint64 key)
{
- Assert(old_child->chunk == new_child->chunk);
- Assert(old_child->shift == new_child->shift);
+ Assert(NODE_CHUNK(old_child) == NODE_CHUNK(new_child));
+ Assert(NODE_SHIFT(old_child) == NODE_SHIFT(new_child));
- if (parent == old_child)
+ if (rt_node_ptr_eq(&parent, &old_child))
{
/* Replace the root node with the new large node */
- tree->root = new_child;
+ tree->root = new_child.encoded;
}
else
{
bool replaced PG_USED_FOR_ASSERTS_ONLY;
- replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+ replaced = rt_node_insert_inner(tree, InvalidRTNodePtr, parent, key,
+ new_child);
Assert(replaced);
}
@@ -1035,24 +1105,28 @@ static void
rt_extend(radix_tree *tree, uint64 key)
{
int target_shift;
- int shift = tree->root->shift + RT_NODE_SPAN;
+ rt_node *root = rt_pointer_decode(tree->root);
+ int shift = root->shift + RT_NODE_SPAN;
target_shift = key_get_shift(key);
/* Grow tree from 'shift' to 'target_shift' */
while (shift <= target_shift)
{
- rt_node_inner_4 *node;
+ rt_node_ptr node;
+ rt_node_inner_4 *n4;
+
+ node = rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+ rt_init_node(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
- node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
- rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
- node->base.n.shift = shift;
- node->base.n.count = 1;
- node->base.chunks[0] = 0;
- node->children[0] = tree->root;
+ n4 = (rt_node_inner_4 *) node.decoded;
+ n4->base.n.shift = shift;
+ n4->base.n.count = 1;
+ n4->base.chunks[0] = 0;
+ n4->children[0] = tree->root;
- tree->root->chunk = 0;
- tree->root = (rt_node *) node;
+ root->chunk = 0;
+ tree->root = node.encoded;
shift += RT_NODE_SPAN;
}
@@ -1065,21 +1139,22 @@ rt_extend(radix_tree *tree, uint64 key)
* Insert inner and leaf nodes from 'node' to bottom.
*/
static inline void
-rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
- rt_node *node)
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
+ rt_node_ptr node)
{
- int shift = node->shift;
+ int shift = NODE_SHIFT(node);
while (shift >= RT_NODE_SPAN)
{
- rt_node *newchild;
+ rt_node_ptr newchild;
int newshift = shift - RT_NODE_SPAN;
bool inner = newshift > 0;
newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
- newchild->shift = newshift;
- newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ NODE_SHIFT(newchild) = newshift;
+ NODE_CHUNK(newchild) = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
+
rt_node_insert_inner(tree, parent, node, key, newchild);
parent = node;
@@ -1099,17 +1174,18 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
* pointer is set to child_p.
*/
static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
+ rt_pointer *child_p)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool found = false;
- rt_node *child = NULL;
+ rt_pointer child;
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
if (idx < 0)
@@ -1127,7 +1203,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
}
case RT_NODE_KIND_32:
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
if (idx < 0)
@@ -1143,7 +1219,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
}
case RT_NODE_KIND_125:
{
- rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
break;
@@ -1159,7 +1235,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
}
case RT_NODE_KIND_256:
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
if (!node_inner_256_is_chunk_used(n256, chunk))
break;
@@ -1176,7 +1252,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
/* update statistics */
if (action == RT_ACTION_DELETE && found)
- node->count--;
+ NODE_COUNT(node)--;
if (found && child_p)
*child_p = child;
@@ -1192,17 +1268,17 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
* to the value is set to value_p.
*/
static inline bool
-rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+rt_node_search_leaf(rt_node_ptr node, uint64 key, rt_action action, uint64 *value_p)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool found = false;
uint64 value = 0;
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
if (idx < 0)
@@ -1220,7 +1296,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
}
case RT_NODE_KIND_32:
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
if (idx < 0)
@@ -1236,7 +1312,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
}
case RT_NODE_KIND_125:
{
- rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
break;
@@ -1252,7 +1328,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
}
case RT_NODE_KIND_256:
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
if (!node_leaf_256_is_chunk_used(n256, chunk))
break;
@@ -1269,7 +1345,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
/* update statistics */
if (action == RT_ACTION_DELETE && found)
- node->count--;
+ NODE_COUNT(node)--;
if (found && value_p)
*value_p = value;
@@ -1279,19 +1355,19 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
/* Insert the child to the inner node */
static bool
-rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
- rt_node *child)
+rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+ uint64 key, rt_node_ptr child)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool chunk_exists = false;
Assert(!NODE_IS_LEAF(node));
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
int idx;
idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1299,25 +1375,27 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
{
/* found the existing chunk */
chunk_exists = true;
- n4->children[idx] = child;
+ n4->children[idx] = child.encoded;
break;
}
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
{
+ rt_node_ptr new;
rt_node_inner_32 *new32;
- Assert(parent != NULL);
+
+ Assert(RTNodePtrIsValid(parent));
/* grow node from 4 to 32 */
- new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+ new32 = (rt_node_inner_32 *) new.decoded;
+
chunk_children_array_copy(n4->base.chunks, n4->children,
new32->base.chunks, new32->children);
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
- key);
- node = (rt_node *) new32;
+ Assert(RTNodePtrIsValid(parent));
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1330,14 +1408,14 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
count, insertpos);
n4->base.chunks[insertpos] = chunk;
- n4->children[insertpos] = child;
+ n4->children[insertpos] = child.encoded;
break;
}
}
/* FALLTHROUGH */
case RT_NODE_KIND_32:
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
int idx;
idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1345,45 +1423,52 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
{
/* found the existing chunk */
chunk_exists = true;
- n32->children[idx] = child;
+ n32->children[idx] = child.encoded;
break;
}
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
{
- Assert(parent != NULL);
+ Assert(RTNodePtrIsValid(parent));
if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
{
/* use the same node kind, but expand to the next size class */
const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size;
const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+ rt_node_ptr new;
rt_node_inner_32 *new32;
- new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+ new = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+ new32 = (rt_node_inner_32 *) new.decoded;
memcpy(new32, n32, size);
new32->base.n.fanout = fanout;
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+ rt_replace_node(tree, parent, node, new, key);
- /* must update both pointers here */
- node = (rt_node *) new32;
+ /*
+ * Must update both pointers here since we update n32 and
+ * verify node.
+ */
+ node = new;
n32 = new32;
goto retry_insert_inner_32;
}
else
{
+ rt_node_ptr new;
rt_node_inner_125 *new125;
/* grow node from 32 to 125 */
- new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
- RT_NODE_KIND_125);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+ new125 = (rt_node_inner_125 *) new.decoded;
+
for (int i = 0; i < n32->base.n.count; i++)
node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
- node = (rt_node *) new125;
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
}
else
@@ -1398,7 +1483,7 @@ retry_insert_inner_32:
count, insertpos);
n32->base.chunks[insertpos] = chunk;
- n32->children[insertpos] = child;
+ n32->children[insertpos] = child.encoded;
break;
}
}
@@ -1406,25 +1491,28 @@ retry_insert_inner_32:
/* FALLTHROUGH */
case RT_NODE_KIND_125:
{
- rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
int cnt = 0;
if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
{
/* found the existing chunk */
chunk_exists = true;
- node_inner_125_update(n125, chunk, child);
+ node_inner_125_update(n125, chunk, child.encoded);
break;
}
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
{
+ rt_node_ptr new;
rt_node_inner_256 *new256;
- Assert(parent != NULL);
+
+ Assert(RTNodePtrIsValid(parent));
/* grow node from 125 to 256 */
- new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
- RT_NODE_KIND_256);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+ new256 = (rt_node_inner_256 *) new.decoded;
+
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
{
if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
@@ -1434,32 +1522,31 @@ retry_insert_inner_32:
cnt++;
}
- rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
- key);
- node = (rt_node *) new256;
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
- node_inner_125_insert(n125, chunk, child);
+ node_inner_125_insert(n125, chunk, child.encoded);
break;
}
}
/* FALLTHROUGH */
case RT_NODE_KIND_256:
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
- node_inner_256_set(n256, chunk, child);
+ node_inner_256_set(n256, chunk, child.encoded);
break;
}
}
/* Update statistics */
if (!chunk_exists)
- node->count++;
+ NODE_COUNT(node)++;
/*
* Done. Finally, verify the chunk and value is inserted or replaced
@@ -1472,19 +1559,19 @@ retry_insert_inner_32:
/* Insert the value to the leaf node */
static bool
-rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
uint64 key, uint64 value)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool chunk_exists = false;
Assert(NODE_IS_LEAF(node));
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
int idx;
idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1498,16 +1585,18 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
{
+ rt_node_ptr new;
rt_node_leaf_32 *new32;
- Assert(parent != NULL);
+
+ Assert(RTNodePtrIsValid(parent));
/* grow node from 4 to 32 */
- new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+ new32 = (rt_node_leaf_32 *) new.decoded;
chunk_values_array_copy(n4->base.chunks, n4->values,
new32->base.chunks, new32->values);
- rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
- node = (rt_node *) new32;
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1527,7 +1616,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* FALLTHROUGH */
case RT_NODE_KIND_32:
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
int idx;
idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1541,45 +1630,51 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
{
- Assert(parent != NULL);
+ Assert(RTNodePtrIsValid(parent));
if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
{
/* use the same node kind, but expand to the next size class */
const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+ rt_node_ptr new;
rt_node_leaf_32 *new32;
- new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+ new = rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+ new32 = (rt_node_leaf_32 *) new.decoded;
memcpy(new32, n32, size);
new32->base.n.fanout = fanout;
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+ rt_replace_node(tree, parent, node, new, key);
- /* must update both pointers here */
- node = (rt_node *) new32;
+ /*
+ * Must update both pointers here since we update n32 and
+ * verify node.
+ */
+ node = new;
n32 = new32;
goto retry_insert_leaf_32;
}
else
{
+ rt_node_ptr new;
rt_node_leaf_125 *new125;
/* grow node from 32 to 125 */
- new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
- RT_NODE_KIND_125);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+ new125 = (rt_node_leaf_125 *) new.decoded;
+
for (int i = 0; i < n32->base.n.count; i++)
node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
- key);
- node = (rt_node *) new125;
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
}
else
{
- retry_insert_leaf_32:
+retry_insert_leaf_32:
{
int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
int count = n32->base.n.count;
@@ -1597,7 +1692,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* FALLTHROUGH */
case RT_NODE_KIND_125:
{
- rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
int cnt = 0;
if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
@@ -1610,12 +1705,14 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
{
+ rt_node_ptr new;
rt_node_leaf_256 *new256;
- Assert(parent != NULL);
+
+ Assert(RTNodePtrIsValid(parent));
/* grow node from 125 to 256 */
- new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
- RT_NODE_KIND_256);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+ new256 = (rt_node_leaf_256 *) new.decoded;
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
{
if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
@@ -1625,9 +1722,8 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
cnt++;
}
- rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
- key);
- node = (rt_node *) new256;
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1638,7 +1734,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* FALLTHROUGH */
case RT_NODE_KIND_256:
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
@@ -1650,7 +1746,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* Update statistics */
if (!chunk_exists)
- node->count++;
+ NODE_COUNT(node)++;
/*
* Done. Finally, verify the chunk and value is inserted or replaced
@@ -1674,7 +1770,7 @@ rt_create(MemoryContext ctx)
tree = palloc(sizeof(radix_tree));
tree->context = ctx;
- tree->root = NULL;
+ tree->root = InvalidRTPointer;
tree->max_val = 0;
tree->num_keys = 0;
@@ -1723,26 +1819,23 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
{
int shift;
bool updated;
- rt_node *node;
- rt_node *parent;
+ rt_node_ptr node;
+ rt_node_ptr parent;
/* Empty tree, create the root */
- if (!tree->root)
+ if (!RTPointerIsValid(tree->root))
rt_new_root(tree, key);
/* Extend the tree if necessary */
if (key > tree->max_val)
rt_extend(tree, key);
- Assert(tree->root);
-
- shift = tree->root->shift;
- node = parent = tree->root;
-
/* Descend the tree until a leaf node */
+ node = parent = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
while (shift >= 0)
{
- rt_node *child;
+ rt_pointer child;
if (NODE_IS_LEAF(node))
break;
@@ -1754,7 +1847,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
}
parent = node;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
}
@@ -1775,21 +1868,21 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
bool
rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
{
- rt_node *node;
+ rt_node_ptr node;
int shift;
Assert(value_p != NULL);
- if (!tree->root || key > tree->max_val)
+ if (!RTPointerIsValid(tree->root) || key > tree->max_val)
return false;
- node = tree->root;
- shift = tree->root->shift;
+ node = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
/* Descend the tree until a leaf node */
while (shift >= 0)
{
- rt_node *child;
+ rt_pointer child;
if (NODE_IS_LEAF(node))
break;
@@ -1797,7 +1890,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
}
@@ -1811,8 +1904,8 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
bool
rt_delete(radix_tree *tree, uint64 key)
{
- rt_node *node;
- rt_node *stack[RT_MAX_LEVEL] = {0};
+ rt_node_ptr node;
+ rt_node_ptr stack[RT_MAX_LEVEL] = {0};
int shift;
int level;
bool deleted;
@@ -1824,12 +1917,12 @@ rt_delete(radix_tree *tree, uint64 key)
* Descend the tree to search the key while building a stack of nodes we
* visited.
*/
- node = tree->root;
- shift = tree->root->shift;
+ node = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
level = -1;
while (shift > 0)
{
- rt_node *child;
+ rt_pointer child;
/* Push the current node to the stack */
stack[++level] = node;
@@ -1837,7 +1930,7 @@ rt_delete(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
}
@@ -1888,6 +1981,7 @@ rt_iter *
rt_begin_iterate(radix_tree *tree)
{
MemoryContext old_ctx;
+ rt_node_ptr root;
rt_iter *iter;
int top_level;
@@ -1897,17 +1991,18 @@ rt_begin_iterate(radix_tree *tree)
iter->tree = tree;
/* empty tree */
- if (!iter->tree->root)
+ if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->root))
return iter;
- top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ root = rt_node_ptr_encoded(iter->tree->root);
+ top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
iter->stack_len = top_level;
/*
* Descend to the left most leaf node from the root. The key is being
* constructed while descending to the leaf.
*/
- rt_update_iter_stack(iter, iter->tree->root, top_level);
+ rt_update_iter_stack(iter, root, top_level);
MemoryContextSwitchTo(old_ctx);
@@ -1918,14 +2013,15 @@ rt_begin_iterate(radix_tree *tree)
* Update each node_iter for inner nodes in the iterator node stack.
*/
static void
-rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
{
int level = from;
- rt_node *node = from_node;
+ rt_node_ptr node = from_node;
for (;;)
{
rt_node_iter *node_iter = &(iter->stack[level--]);
+ bool found PG_USED_FOR_ASSERTS_ONLY;
node_iter->node = node;
node_iter->current_idx = -1;
@@ -1935,10 +2031,10 @@ rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
return;
/* Advance to the next slot in the inner node */
- node = rt_node_inner_iterate_next(iter, node_iter);
+ found = rt_node_inner_iterate_next(iter, node_iter, &node);
/* We must find the first children in the node */
- Assert(node);
+ Assert(found);
}
}
@@ -1955,7 +2051,7 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
for (;;)
{
- rt_node *child = NULL;
+ rt_node_ptr child = InvalidRTNodePtr;
uint64 value;
int level;
bool found;
@@ -1976,14 +2072,12 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
*/
for (level = 1; level <= iter->stack_len; level++)
{
- child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
-
- if (child)
+ if (rt_node_inner_iterate_next(iter, &(iter->stack[level]), &child))
break;
}
/* the iteration finished */
- if (!child)
+ if (!RTNodePtrIsValid(child))
return false;
/*
@@ -2015,18 +2109,19 @@ rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
* Advance the slot in the inner node. Return the child if exists, otherwise
* null.
*/
-static inline rt_node *
-rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+static inline bool
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *child_p)
{
- rt_node *child = NULL;
+ rt_node_ptr node = node_iter->node;
+ rt_pointer child;
bool found = false;
uint8 key_chunk;
- switch (node_iter->node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n4->base.n.count)
@@ -2039,7 +2134,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
case RT_NODE_KIND_32:
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n32->base.n.count)
@@ -2052,7 +2147,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
case RT_NODE_KIND_125:
{
- rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2072,7 +2167,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
case RT_NODE_KIND_256:
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2093,9 +2188,12 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
if (found)
- rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ {
+ rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
+ *child_p = rt_node_ptr_encoded(child);
+ }
- return child;
+ return found;
}
/*
@@ -2103,19 +2201,18 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
* is set to value_p, otherwise return false.
*/
static inline bool
-rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
- uint64 *value_p)
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_p)
{
- rt_node *node = node_iter->node;
+ rt_node_ptr node = node_iter->node;
bool found = false;
uint64 value;
uint8 key_chunk;
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n4->base.n.count)
@@ -2128,7 +2225,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
}
case RT_NODE_KIND_32:
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n32->base.n.count)
@@ -2141,7 +2238,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
}
case RT_NODE_KIND_125:
{
- rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2161,7 +2258,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
}
case RT_NODE_KIND_256:
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2183,7 +2280,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
if (found)
{
- rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
*value_p = value;
}
@@ -2220,16 +2317,16 @@ rt_memory_usage(radix_tree *tree)
* Verify the radix tree node.
*/
static void
-rt_verify_node(rt_node *node)
+rt_verify_node(rt_node_ptr node)
{
#ifdef USE_ASSERT_CHECKING
- Assert(node->count >= 0);
+ Assert(NODE_COUNT(node) >= 0);
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+ rt_node_base_4 *n4 = (rt_node_base_4 *) node.decoded;
for (int i = 1; i < n4->n.count; i++)
Assert(n4->chunks[i - 1] < n4->chunks[i]);
@@ -2238,7 +2335,7 @@ rt_verify_node(rt_node *node)
}
case RT_NODE_KIND_32:
{
- rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+ rt_node_base_32 *n32 = (rt_node_base_32 *) node.decoded;
for (int i = 1; i < n32->n.count; i++)
Assert(n32->chunks[i - 1] < n32->chunks[i]);
@@ -2247,7 +2344,7 @@ rt_verify_node(rt_node *node)
}
case RT_NODE_KIND_125:
{
- rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node.decoded;
int cnt = 0;
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2257,10 +2354,10 @@ rt_verify_node(rt_node *node)
/* Check if the corresponding slot is used */
if (NODE_IS_LEAF(node))
- Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) node,
+ Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) n125,
n125->slot_idxs[i]));
else
- Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) node,
+ Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) n125,
n125->slot_idxs[i]));
cnt++;
@@ -2273,7 +2370,7 @@ rt_verify_node(rt_node *node)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
int cnt = 0;
for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
@@ -2294,54 +2391,62 @@ rt_verify_node(rt_node *node)
void
rt_stats(radix_tree *tree)
{
+ rt_node *root = rt_pointer_decode(tree->root);
+
+ if (root == NULL)
+ return;
+
ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
- tree->num_keys,
- tree->root->shift / RT_NODE_SPAN,
- tree->cnt[RT_CLASS_4_FULL],
- tree->cnt[RT_CLASS_32_PARTIAL],
- tree->cnt[RT_CLASS_32_FULL],
- tree->cnt[RT_CLASS_125_FULL],
- tree->cnt[RT_CLASS_256])));
+ tree->num_keys,
+ root->shift / RT_NODE_SPAN,
+ tree->cnt[RT_CLASS_4_FULL],
+ tree->cnt[RT_CLASS_32_PARTIAL],
+ tree->cnt[RT_CLASS_32_FULL],
+ tree->cnt[RT_CLASS_125_FULL],
+ tree->cnt[RT_CLASS_256])));
}
static void
-rt_dump_node(rt_node *node, int level, bool recurse)
+rt_dump_node(rt_node_ptr node, int level, bool recurse)
{
- char space[125] = {0};
+ rt_node *n = node.decoded;
+ char space[128] = {0};
fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
NODE_IS_LEAF(node) ? "LEAF" : "INNR",
- (node->kind == RT_NODE_KIND_4) ? 4 :
- (node->kind == RT_NODE_KIND_32) ? 32 :
- (node->kind == RT_NODE_KIND_125) ? 125 : 256,
- node->fanout == 0 ? 256 : node->fanout,
- node->count, node->shift, node->chunk);
+
+ (n->kind == RT_NODE_KIND_4) ? 4 :
+ (n->kind == RT_NODE_KIND_32) ? 32 :
+ (n->kind == RT_NODE_KIND_125) ? 125 : 256,
+ n->fanout == 0 ? 256 : n->fanout,
+ n->count, n->shift, n->chunk);
if (level > 0)
sprintf(space, "%*c", level * 4, ' ');
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- for (int i = 0; i < node->count; i++)
+ for (int i = 0; i < NODE_COUNT(node); i++)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
space, n4->base.chunks[i], n4->values[i]);
}
else
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
fprintf(stderr, "%schunk 0x%X ->",
space, n4->base.chunks[i]);
if (recurse)
- rt_dump_node(n4->children[i], level + 1, recurse);
+ rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+ level + 1, recurse);
else
fprintf(stderr, "\n");
}
@@ -2350,25 +2455,26 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
case RT_NODE_KIND_32:
{
- for (int i = 0; i < node->count; i++)
+ for (int i = 0; i < NODE_KIND(node); i++)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
space, n32->base.chunks[i], n32->values[i]);
}
else
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
fprintf(stderr, "%schunk 0x%X ->",
space, n32->base.chunks[i]);
if (recurse)
{
- rt_dump_node(n32->children[i], level + 1, recurse);
+ rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+ level + 1, recurse);
}
else
fprintf(stderr, "\n");
@@ -2378,7 +2484,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
case RT_NODE_KIND_125:
{
- rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+ rt_node_base_125 *b125 = (rt_node_base_125 *) node.decoded;
fprintf(stderr, "slot_idxs ");
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2390,7 +2496,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+ rt_node_leaf_125 *n = (rt_node_leaf_125 *) node.decoded;
fprintf(stderr, ", isset-bitmap:");
for (int i = 0; i < WORDNUM(128); i++)
@@ -2420,7 +2526,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(node_inner_125_get_child(n125, i),
+ rt_dump_node(rt_node_ptr_encoded(node_inner_125_get_child(n125, i)),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2434,7 +2540,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
if (!node_leaf_256_is_chunk_used(n256, i))
continue;
@@ -2444,7 +2550,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
else
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
if (!node_inner_256_is_chunk_used(n256, i))
continue;
@@ -2453,8 +2559,8 @@ rt_dump_node(rt_node *node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
- recurse);
+ rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+ level + 1, recurse);
else
fprintf(stderr, "\n");
}
@@ -2467,7 +2573,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
void
rt_dump_search(radix_tree *tree, uint64 key)
{
- rt_node *node;
+ rt_node_ptr node;
int shift;
int level = 0;
@@ -2475,7 +2581,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
tree->max_val, tree->max_val);
- if (!tree->root)
+ if (!RTPointerIsValid(tree->root))
{
elog(NOTICE, "tree is empty");
return;
@@ -2488,11 +2594,11 @@ rt_dump_search(radix_tree *tree, uint64 key)
return;
}
- node = tree->root;
- shift = tree->root->shift;
+ node = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
while (shift >= 0)
{
- rt_node *child;
+ rt_pointer child;
rt_dump_node(node, level, false);
@@ -2509,7 +2615,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
break;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
level++;
}
@@ -2518,6 +2624,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
void
rt_dump(radix_tree *tree)
{
+ rt_node_ptr root;
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
@@ -2528,12 +2635,13 @@ rt_dump(radix_tree *tree)
rt_size_class_info[i].leaf_blocksize);
fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
- if (!tree->root)
+ if (!RTPointerIsValid(tree->root))
{
fprintf(stderr, "empty tree\n");
return;
}
- rt_dump_node(tree->root, 0, true);
+ root = rt_node_ptr_encoded(tree->root);
+ rt_dump_node(root, 0, true);
}
#endif
--
2.31.1
v14-0009-PoC-lazy-vacuum-integration.patchapplication/octet-stream; name=v14-0009-PoC-lazy-vacuum-integration.patchDownload
From 2431edf71e7e22248af46588f554c47cd169cec7 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 4 Nov 2022 14:14:42 +0900
Subject: [PATCH v14 9/9] PoC: lazy vacuum integration.
The patch includes:
* Introducing a new module, TIDStore, to store TID in radix tree.
* Integrating TIDStore with Lazy (parallel) vacuum.
---
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 531 ++++++++++++++++++++++++++
src/backend/access/heap/vacuumlazy.c | 170 +++------
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 76 +---
src/backend/commands/vacuumparallel.c | 64 ++--
src/backend/storage/lmgr/lwlock.c | 2 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/access/tidstore.h | 49 +++
src/include/commands/progress.h | 4 +-
src/include/commands/vacuum.h | 25 +-
src/include/storage/lwlock.h | 1 +
src/test/regress/expected/rules.out | 4 +-
14 files changed, 696 insertions(+), 236 deletions(-)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index 857beaa32d..76265974b1 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -13,6 +13,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..770c4ab5bf
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,531 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * TID (ItemPointer) storage implementation.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "lib/radixtree.h"
+#include "port/pg_bitutils.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+#include "miscadmin.h"
+
+/* XXX only testing purpose during development, will be removed */
+#define XXX_DEBUG_TID_STORE 1
+
+/*
+ * For encoding purposes, item pointers are represented as a pair of 64-bit
+ * key and 64-bit value. We construct 64-bit unsigned integer that combines
+ * the block number and the offset number. The lowest 11 bits represent the
+ * offset number, and the next 32 bits are block number. That is, only 43
+ * bits are used:
+ *
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ *
+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with
+ * the bits, because smaller keys could help keeping the radix tree shallow.
+ *
+ * XXX: If we want to support other table AMs that want to use the full range
+ * of possible offset numbers, we'll need to change this.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits, and
+ * the rest 37 bits are used as the key:
+ *
+ * value = bitmap representation of XXXXXX
+ * key = XXXXXYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYuu
+ */
+#define TIDSTORE_OFFSET_NBITS 11
+#define TIDSTORE_VALUE_NBITS 6 /* log(sizeof(uint64) * BITS_PER_BYTE, 2) */
+
+/* Get block number from the key */
+#define KEY_GET_BLKNO(key) \
+ ((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+struct TIDStore
+{
+ /* main storage for TID */
+ radix_tree *tree;
+
+ /* # of tids in TIDStore */
+ int num_tids;
+
+ /* maximum bytes TIDStore can consume */
+ uint64 max_bytes;
+
+ /* DSA area and handle for shared TIDStore */
+ rt_handle handle;
+ dsa_area *area;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ uint64 max_items;
+ ItemPointer itemptrs;
+ uint64 nitems;
+#endif
+};
+
+/* Iterator for TDIStore */
+typedef struct TIDStoreIter
+{
+ TIDStore *ts;
+
+ /* iterator of radix tree */
+ rt_iter *tree_iter;
+
+ bool finished;
+
+ /* save for the next iteration */
+ uint64 next_key;
+ uint64 next_val;
+
+ /* output for the caller */
+ TIDStoreIterResult result;
+
+#ifdef USE_ASSERT_CHECKING
+ uint64 itemptrs_index;
+ int prev_index;
+#endif
+} TIDStoreIter;
+
+static void tidstore_iter_extract_tids(TIDStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+/*
+ * Comparator routines for use with qsort() and bsearch().
+ */
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+
+static void
+verify_iter_tids(TIDStoreIter *iter)
+{
+ uint64 index = iter->prev_index;
+ TIDStoreIterResult *result = &(iter->result);
+
+ if (iter->ts->itemptrs == NULL)
+ return;
+
+ Assert(index <= iter->ts->nitems);
+
+ for (int i = 0; i < result->num_offsets; i++)
+ {
+ ItemPointerData tid;
+
+ ItemPointerSetBlockNumber(&tid, result->blkno);
+ ItemPointerSetOffsetNumber(&tid, result->offsets[i]);
+
+ Assert(ItemPointerEquals(&iter->ts->itemptrs[index++], &tid));
+ }
+
+ iter->prev_index = iter->itemptrs_index;
+}
+
+static void
+dump_itemptrs(TIDStore *ts)
+{
+ StringInfoData buf;
+
+ if (ts->itemptrs == NULL)
+ return;
+
+ initStringInfo(&buf);
+ for (int i = 0; i < ts->nitems; i++)
+ {
+ appendStringInfo(&buf, "(%d,%d) ",
+ ItemPointerGetBlockNumber(&(ts->itemptrs[i])),
+ ItemPointerGetOffsetNumber(&(ts->itemptrs[i])));
+ }
+ elog(WARNING, "--- dump (" UINT64_FORMAT " items) ---", ts->nitems);
+ elog(WARNING, "%s\n", buf.data);
+}
+
+#endif
+
+/*
+ * Create a TIDStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TIDStore *
+tidstore_create(uint64 max_bytes, dsa_area *area)
+{
+ TIDStore *ts;
+
+ ts = palloc0(sizeof(TIDStore));
+
+ ts->tree = rt_create(CurrentMemoryContext, area);
+ ts->area = area;
+ ts->max_bytes = max_bytes;
+
+ if (area != NULL)
+ ts->handle = rt_get_handle(ts->tree);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+#define MAXDEADITEMS(avail_mem) \
+ (avail_mem / sizeof(ItemPointerData))
+
+ if (area == NULL)
+ {
+ ts->max_items = MAXDEADITEMS(maintenance_work_mem * 1024);
+ ts->itemptrs = (ItemPointer) palloc0(sizeof(ItemPointerData) * ts->max_items);
+ ts->nitems = 0;
+ }
+#endif
+
+ return ts;
+}
+
+/* Attach to the shared TIDStore using a handle */
+TIDStore *
+tidstore_attach(dsa_area *area, rt_handle handle)
+{
+ TIDStore *ts;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ ts = palloc0(sizeof(TIDStore));
+ ts->tree = rt_attach(area, handle);
+
+ return ts;
+}
+
+/*
+ * Detach from a TIDStore. This detaches from radix tree and frees the
+ * backend-local resources.
+ */
+void
+tidstore_detach(TIDStore *ts)
+{
+ rt_detach(ts->tree);
+ pfree(ts);
+}
+
+void
+tidstore_free(TIDStore *ts)
+{
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ if (ts->itemptrs)
+ pfree(ts->itemptrs);
+#endif
+
+ rt_free(ts->tree);
+ pfree(ts);
+}
+
+void
+tidstore_reset(TIDStore *ts)
+{
+ dsa_area *area = ts->area;
+
+ /* Reset the statistics */
+ ts->num_tids = 0;
+
+ /* Free the radix tree */
+ rt_free(ts->tree);
+
+ if (ts->area)
+ dsa_trim(area);
+
+ ts->tree = rt_create(CurrentMemoryContext, area);
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ ts->nitems = 0;
+#endif
+}
+
+/* Add TIDs to TIDStore */
+void
+tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 key;
+ uint64 val = 0;
+ ItemPointerData tid;
+
+ ItemPointerSetBlockNumber(&tid, blkno);
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint32 off;
+
+ ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+ key = tid_to_key_off(&tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(ts->tree, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= UINT64CONST(1) << off;
+ ts->num_tids++;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ if (ts->itemptrs)
+ {
+ if (ts->nitems >= ts->max_items)
+ {
+ ts->max_items *= 2;
+ ts->itemptrs = repalloc(ts->itemptrs, sizeof(ItemPointerData) * ts->max_items);
+ }
+
+ Assert(ts->nitems < ts->max_items);
+ ItemPointerSetBlockNumber(&(ts->itemptrs[ts->nitems]), blkno);
+ ItemPointerSetOffsetNumber(&(ts->itemptrs[ts->nitems]), offsets[i]);
+ ts->nitems++;
+ }
+#endif
+ }
+
+ if (last_key != PG_UINT64_MAX)
+ {
+ rt_set(ts->tree, last_key, val);
+ val = 0;
+ }
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ if (ts->itemptrs)
+ Assert(ts->nitems == ts->num_tids);
+#endif
+}
+
+/* Return true if the given TID is present in TIDStore */
+bool
+tidstore_lookup_tid(TIDStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val;
+ uint32 off;
+ bool found;
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ bool found_assert;
+#endif
+
+ key = tid_to_key_off(tid, &off);
+
+ found = rt_search(ts->tree, key, &val);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ if (ts->itemptrs)
+ found_assert = bsearch((void *) tid,
+ (void *) ts->itemptrs,
+ ts->nitems,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr) != NULL;
+#endif
+
+ if (!found)
+ {
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ if (ts->itemptrs)
+ Assert(!found_assert);
+#endif
+ return false;
+ }
+
+ found = (val & (UINT64CONST(1) << off)) != 0;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+
+ if (ts->itemptrs && found != found_assert)
+ {
+ elog(WARNING, "tid (%d,%d)\n",
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
+ dump_itemptrs(ts);
+ }
+
+ if (ts->itemptrs)
+ Assert(found == found_assert);
+
+#endif
+ return found;
+}
+
+TIDStoreIter *
+tidstore_begin_iterate(TIDStore *ts)
+{
+ TIDStoreIter *iter;
+
+ iter = palloc0(sizeof(TIDStoreIter));
+ iter->ts = ts;
+ iter->tree_iter = rt_begin_iterate(ts->tree);
+ iter->result.blkno = InvalidBlockNumber;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ iter->itemptrs_index = 0;
+#endif
+
+ return iter;
+}
+
+TIDStoreIterResult *
+tidstore_iterate_next(TIDStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+ TIDStoreIterResult *result = &(iter->result);
+
+ if (iter->finished)
+ return NULL;
+
+ if (BlockNumberIsValid(result->blkno))
+ {
+ result->num_offsets = 0;
+ tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (rt_iterate_next(iter->tree_iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = KEY_GET_BLKNO(key);
+
+ if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+ {
+ /*
+ * Remember the key-value pair for the next block for the
+ * next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ verify_iter_tids(iter);
+#endif
+ return result;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_extract_tids(iter, key, val);
+ }
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ verify_iter_tids(iter);
+#endif
+
+ iter->finished = true;
+ return result;
+}
+
+uint64
+tidstore_num_tids(TIDStore *ts)
+{
+ return ts->num_tids;
+}
+
+bool
+tidstore_is_full(TIDStore *ts)
+{
+ return ((sizeof(TIDStore) + rt_memory_usage(ts->tree)) > ts->max_bytes);
+}
+
+uint64
+tidstore_max_memory(TIDStore *ts)
+{
+ return ts->max_bytes;
+}
+
+uint64
+tidstore_memory_usage(TIDStore *ts)
+{
+ return (uint64) sizeof(TIDStore) + rt_memory_usage(ts->tree);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TIDStore
+ */
+tidstore_handle
+tidstore_get_handle(TIDStore *ts)
+{
+ return rt_get_handle(ts->tree);
+}
+
+/* Extract TIDs from key-value pair */
+static void
+tidstore_iter_extract_tids(TIDStoreIter *iter, uint64 key, uint64 val)
+{
+ TIDStoreIterResult *result = (&iter->result);
+
+ for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ if ((val & (UINT64CONST(1) << i)) == 0)
+ continue;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= i;
+
+ off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+ result->offsets[result->num_offsets++] = off;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ iter->itemptrs_index++;
+#endif
+ }
+
+ result->blkno = KEY_GET_BLKNO(key);
+}
+
+/*
+ * Encode a TID to key and val.
+ */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint64 tid_i;
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+ *off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+ upper = tid_i >> TIDSTORE_VALUE_NBITS;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ return upper;
+}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d59711b7ec..24c1dc7099 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -194,7 +195,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TIDStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -265,8 +266,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer *vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer *vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -853,21 +855,21 @@ lazy_scan_heap(LVRelState *vacrel)
next_unskippable_block,
next_failsafe_block = 0,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TIDStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
const int initprog_index[] = {
PROGRESS_VACUUM_PHASE,
PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
- PROGRESS_VACUUM_MAX_DEAD_TUPLES
+ PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
};
int64 initprog_val[3];
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -937,8 +939,8 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ /* XXX: should not allow tidstore to grow beyond max_bytes */
+ if (tidstore_is_full(vacrel->dead_items))
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -1070,11 +1072,18 @@ lazy_scan_heap(LVRelState *vacrel)
if (prunestate.has_lpdead_items)
{
Size freespace;
+ TIDStoreIter *iter;
+ TIDStoreIterResult *result;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ result = tidstore_iterate_next(iter);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+ buf, &vmbuffer);
+ Assert(!tidstore_iterate_next(iter));
+ pfree(iter);
/* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ tidstore_reset(dead_items);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1111,7 +1120,7 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(tidstore_num_tids(dead_items) == 0);
}
/*
@@ -1264,7 +1273,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (tidstore_num_tids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1863,25 +1872,16 @@ retry:
*/
if (lpdead_items > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TIDStore *dead_items = vacrel->dead_items;
Assert(!prunestate->all_visible);
Assert(prunestate->has_lpdead_items);
vacrel->lpdead_item_pages++;
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
}
/* Finally, add page-local counts to whole-VACUUM counts */
@@ -2088,8 +2088,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TIDStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2098,17 +2097,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2157,7 +2149,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
return;
}
@@ -2186,7 +2178,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2213,8 +2205,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2259,7 +2251,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
}
/*
@@ -2331,7 +2323,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2368,10 +2360,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index;
BlockNumber vacuumed_pages;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TIDStoreIter *iter;
+ TIDStoreIterResult *result;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2388,8 +2381,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuumed_pages = 0;
- index = 0;
- while (index < vacrel->dead_items->num_items)
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ while ((result = tidstore_iterate_next(iter)) != NULL)
{
BlockNumber tblk;
Buffer buf;
@@ -2398,12 +2391,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- tblk = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ tblk = result->blkno;
vacrel->blkno = tblk;
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, tblk, buf, index, &vmbuffer);
+ lazy_vacuum_heap_page(vacrel, tblk, result->offsets, result->num_offsets,
+ buf, &vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2427,14 +2421,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2451,11 +2444,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* LP_DEAD item on the page. The return value is the first index immediately
* after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer *vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets, Buffer buffer, Buffer *vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int uncnt = 0;
@@ -2474,16 +2466,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = offsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2563,7 +2550,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -3065,46 +3051,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
return vacrel->nonempty_pages;
}
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
/*
* Allocate dead_items (either using palloc, or in dynamic shared memory).
* Sets dead_items in vacrel for caller.
@@ -3115,11 +3061,9 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
+ int vac_work_mem = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
@@ -3146,7 +3090,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
+ vac_work_mem,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3159,11 +3103,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = tidstore_create(vac_work_mem, NULL);
}
/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..bc42144f08 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1165,7 +1165,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
END AS phase,
S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
- S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+ S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 293b84bbca..7f5776fbf8 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -95,7 +95,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int vac_cmp_itemptr(const void *left, const void *right);
/*
* Primary entry point for manual VACUUM and ANALYZE commands
@@ -2276,16 +2275,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TIDStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ tidstore_num_tids(dead_items))));
return istat;
}
@@ -2316,18 +2315,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
@@ -2338,60 +2325,7 @@ vac_max_items_to_alloc_size(int max_items)
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch((void *) itemptr,
- (void *) dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
-
- return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
- BlockNumber lblk,
- rblk;
- OffsetNumber loff,
- roff;
-
- lblk = ItemPointerGetBlockNumber((ItemPointer) left);
- rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
- if (lblk < rblk)
- return -1;
- if (lblk > rblk)
- return 1;
-
- loff = ItemPointerGetOffsetNumber((ItemPointer) left);
- roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
- if (loff < roff)
- return -1;
- if (loff > roff)
- return 1;
+ TIDStore *dead_items = (TIDStore *) state;
- return 0;
+ return tidstore_lookup_tid(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index f26d796e52..429607d5fa 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
* use small integers.
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 3
#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 4
#define PARALLEL_VACUUM_KEY_WAL_USAGE 5
@@ -103,6 +103,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TIDStore */
+ tidstore_handle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
+ dsa_area *dead_items_area;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = tidstore_create(vac_work_mem, dead_items_dsa);
+ pvs->dead_items = dead_items;
+ pvs->dead_items_area = dead_items_dsa;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = tidstore_get_handle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ tidstore_free(pvs->dead_items);
+ dsa_detach(pvs->dead_items_area);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TIDStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ tidstore_detach(pvs.dead_items);
+ dsa_detach(dead_items_area);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 528b2e9643..ea8cf6283b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -186,6 +186,8 @@ static const char *const BuiltinTrancheNames[] = {
"PgStatsHash",
/* LWTRANCHE_PGSTATS_DATA: */
"PgStatsData",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1bf14eec66..5d9808977e 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2280,7 +2280,7 @@ struct config_int ConfigureNamesInt[] =
GUC_UNIT_KB
},
&maintenance_work_mem,
- 65536, 1024, MAX_KILOBYTES,
+ 65536, 2048, MAX_KILOBYTES,
NULL, NULL, NULL
},
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..4a7ab3f5a8
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * TID storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "lib/radixtree.h"
+#include "storage/itemptr.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TIDStore TIDStore;
+typedef struct TIDStoreIter TIDStoreIter;
+
+typedef struct TIDStoreIterResult
+{
+ BlockNumber blkno;
+ OffsetNumber offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+ int num_offsets;
+} TIDStoreIterResult;
+
+extern TIDStore *tidstore_create(uint64 max_bytes, dsa_area *dsa);
+extern TIDStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TIDStore *ts);
+extern void tidstore_free(TIDStore *ts);
+extern void tidstore_reset(TIDStore *ts);
+extern void tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TIDStore *ts, ItemPointer tid);
+extern TIDStoreIter * tidstore_begin_iterate(TIDStore *ts);
+extern TIDStoreIterResult *tidstore_iterate_next(TIDStoreIter *iter);
+extern uint64 tidstore_num_tids(TIDStore *ts);
+extern bool tidstore_is_full(TIDStore *ts);
+extern uint64 tidstore_max_memory(TIDStore *ts);
+extern uint64 tidstore_memory_usage(TIDStore *ts);
+extern tidstore_handle tidstore_get_handle(TIDStore *ts);
+
+#endif /* TIDSTORE_H */
+
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index a28938caf4..75d540d315 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
#define PROGRESS_VACUUM_HEAP_BLKS_SCANNED 2
#define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED 3
#define PROGRESS_VACUUM_NUM_INDEX_VACUUMS 4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES 5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES 5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES 6
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 4e4bc26a8b..afe61c21fd 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -235,21 +236,6 @@ typedef struct VacuumParams
int nworkers;
} VacuumParams;
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -302,18 +288,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TIDStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int vac_work_mem,
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TIDStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index dd818e16ab..f1e0bcede5 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -204,6 +204,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DSA,
LWTRANCHE_PGSTATS_HASH,
LWTRANCHE_PGSTATS_DATA,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..0c49354f04 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param3 AS heap_blks_scanned,
s.param4 AS heap_blks_vacuumed,
s.param5 AS index_vacuum_count,
- s.param6 AS max_dead_tuples,
- s.param7 AS num_dead_tuples
+ s.param6 AS max_dead_tuple_bytes,
+ s.param7 AS dead_tuple_bytes
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT s.stats_reset,
--
2.31.1
v14-0007-PoC-DSA-support-for-radix-tree.patchapplication/octet-stream; name=v14-0007-PoC-DSA-support-for-radix-tree.patchDownload
From d575b8f8215494d9ac82b256b260acd921de1928 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 16:42:55 +0700
Subject: [PATCH v14 7/9] PoC: DSA support for radix tree
---
.../bench_radix_tree--1.0.sql | 2 +
contrib/bench_radix_tree/bench_radix_tree.c | 16 +-
src/backend/lib/radixtree.c | 437 ++++++++++++++----
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 8 +-
src/include/utils/dsa.h | 1 +
.../expected/test_radixtree.out | 25 +
.../modules/test_radixtree/test_radixtree.c | 147 ++++--
8 files changed, 502 insertions(+), 146 deletions(-)
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 83529805fc..d9216d715c 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -7,6 +7,7 @@ create function bench_shuffle_search(
minblk int4,
maxblk int4,
random_block bool DEFAULT false,
+shared bool DEFAULT false,
OUT nkeys int8,
OUT rt_mem_allocated int8,
OUT array_mem_allocated int8,
@@ -23,6 +24,7 @@ create function bench_seq_search(
minblk int4,
maxblk int4,
random_block bool DEFAULT false,
+shared bool DEFAULT false,
OUT nkeys int8,
OUT rt_mem_allocated int8,
OUT array_mem_allocated int8,
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index a0693695e6..1a26722495 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -154,6 +154,8 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
BlockNumber maxblk = PG_GETARG_INT32(1);
bool random_block = PG_GETARG_BOOL(2);
radix_tree *rt = NULL;
+ bool shared = PG_GETARG_BOOL(3);
+ dsa_area *dsa = NULL;
uint64 ntids;
uint64 key;
uint64 last_key = PG_UINT64_MAX;
@@ -176,7 +178,11 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
/* measure the load time of the radix tree */
- rt = rt_create(CurrentMemoryContext);
+ if (shared)
+ dsa = dsa_create(LWLockNewTrancheId());
+ rt = rt_create(CurrentMemoryContext, dsa);
+
+ /* measure the load time of the radix tree */
start_time = GetCurrentTimestamp();
for (int i = 0; i < ntids; i++)
{
@@ -327,7 +333,7 @@ bench_load_random_int(PG_FUNCTION_ARGS)
elog(ERROR, "return type must be a row type");
pg_prng_seed(&state, 0);
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
start_time = GetCurrentTimestamp();
for (uint64 i = 0; i < cnt; i++)
@@ -393,7 +399,7 @@ bench_search_random_nodes(PG_FUNCTION_ARGS)
}
elog(NOTICE, "bench with filter 0x%lX", filter);
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
for (uint64 i = 0; i < cnt; i++)
{
@@ -462,7 +468,7 @@ bench_fixed_height_search(PG_FUNCTION_ARGS)
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
start_time = GetCurrentTimestamp();
@@ -574,7 +580,7 @@ bench_node128_load(PG_FUNCTION_ARGS)
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
key_id = 0;
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index bff37a2c35..455071cbab 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -22,6 +22,15 @@
* choose it to avoid an additional pointer traversal. It is the reason this code
* currently does not support variable-length keys.
*
+ * If DSA area is specified for rt_create(), the radix tree is created in the
+ * DSA area so that multiple processes can access to it simultaneously. The process
+ * who created the shared radix tree needs to tell both DSA area specified when
+ * calling to rt_create() and dsa_pointer of the radix tree, fetched by
+ * rt_get_dsa_pointer(), to other processes so that they can attach by rt_attach().
+ *
+ * XXX: shared radix tree is still PoC state as it doesn't have any locking support.
+ * Also, it supports the iteration only by one process.
+ *
* XXX: Most functions in this file have two variants for inner nodes and leaf
* nodes, therefore there are duplication codes. While this sometimes makes the
* code maintenance tricky, this reduces branch prediction misses when judging
@@ -34,6 +43,9 @@
*
* rt_create - Create a new, empty radix tree
* rt_free - Free the radix tree
+ * rt_attach - Attach to the radix tree
+ * rt_detach - Detach from the radix tree
+ * rt_get_handle - Return the handle of the radix tree
* rt_search - Search a key-value pair
* rt_set - Set a key-value pair
* rt_delete - Delete a key-value pair
@@ -65,6 +77,7 @@
#include "nodes/bitmapset.h"
#include "port/pg_bitutils.h"
#include "port/pg_lfind.h"
+#include "utils/dsa.h"
#include "utils/memutils.h"
#ifdef RT_DEBUG
@@ -426,6 +439,10 @@ static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
* construct the key whenever updating the node iteration information, e.g., when
* advancing the current index within the node or when moving to the next node
* at the same level.
+ *
+ * XXX: We need either a safeguard to disallow other processes to begin the
+ * iteration while one process is doing or to allow multiple processes to do
+ * the iteration.
*/
typedef struct rt_node_iter
{
@@ -445,23 +462,43 @@ struct rt_iter
uint64 key;
};
-/* A radix tree with nodes */
-struct radix_tree
+/* A magic value used to identify our radix tree */
+#define RADIXTREE_MAGIC 0x54A48167
+
+/* Control information for an radix tree */
+typedef struct radix_tree_control
{
- MemoryContext context;
+ rt_handle handle;
+ uint32 magic;
+ /* Root node */
rt_pointer root;
+
uint64 max_val;
uint64 num_keys;
- MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
- MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
-
/* statistics */
#ifdef RT_DEBUG
int32 cnt[RT_SIZE_CLASS_COUNT];
#endif
+} radix_tree_control;
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ /* control object in either backend-local memory or DSA */
+ radix_tree_control *ctl;
+
+ /* used only when the radix tree is shared */
+ dsa_area *area;
+
+ /* used only when the radix tree is private */
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
};
+#define RadixTreeIsShared(rt) ((rt)->area != NULL)
static void rt_new_root(radix_tree *tree, uint64 key);
@@ -490,9 +527,12 @@ static void rt_verify_node(rt_node_ptr node);
/* Decode and encode functions of rt_pointer */
static inline rt_node *
-rt_pointer_decode(rt_pointer encoded)
+rt_pointer_decode(radix_tree *tree, rt_pointer encoded)
{
- return (rt_node *) encoded;
+ if (RadixTreeIsShared(tree))
+ return (rt_node *) dsa_get_address(tree->area, encoded);
+ else
+ return (rt_node *) encoded;
}
static inline rt_pointer
@@ -503,11 +543,11 @@ rt_pointer_encode(rt_node *decoded)
/* Return a rt_node_ptr created from the given encoded pointer */
static inline rt_node_ptr
-rt_node_ptr_encoded(rt_pointer encoded)
+rt_node_ptr_encoded(radix_tree *tree, rt_pointer encoded)
{
return (rt_node_ptr) {
.encoded = encoded,
- .decoded = rt_pointer_decode(encoded),
+ .decoded = rt_pointer_decode(tree, encoded)
};
}
@@ -954,8 +994,8 @@ rt_new_root(radix_tree *tree, uint64 key)
rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
NODE_SHIFT(newnode) = shift;
- tree->max_val = shift_get_max_val(shift);
- tree->root = newnode.encoded;
+ tree->ctl->max_val = shift_get_max_val(shift);
+ tree->ctl->root = newnode.encoded;
}
/*
@@ -964,20 +1004,35 @@ rt_new_root(radix_tree *tree, uint64 key)
static rt_node_ptr
rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
{
- rt_node_ptr newnode;
+ rt_node_ptr newnode;
- if (inner)
- newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
- rt_size_class_info[size_class].inner_size);
- else
- newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
- rt_size_class_info[size_class].leaf_size);
+ if (RadixTreeIsShared(tree))
+ {
+ dsa_pointer dp;
- newnode.encoded = rt_pointer_encode(newnode.decoded);
+ if (inner)
+ dp = dsa_allocate(tree->area, rt_size_class_info[size_class].inner_size);
+ else
+ dp = dsa_allocate(tree->area, rt_size_class_info[size_class].leaf_size);
+
+ newnode.encoded = (rt_pointer) dp;
+ newnode.decoded = rt_pointer_decode(tree, newnode.encoded);
+ }
+ else
+ {
+ if (inner)
+ newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+ rt_size_class_info[size_class].inner_size);
+ else
+ newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ rt_size_class_info[size_class].leaf_size);
+
+ newnode.encoded = rt_pointer_encode(newnode.decoded);
+ }
#ifdef RT_DEBUG
/* update the statistics */
- tree->cnt[size_class]++;
+ tree->ctl->cnt[size_class]++;
#endif
return newnode;
@@ -1041,10 +1096,10 @@ static void
rt_free_node(radix_tree *tree, rt_node_ptr node)
{
/* If we're deleting the root node, make the tree empty */
- if (tree->root == node.encoded)
+ if (tree->ctl->root == node.encoded)
{
- tree->root = InvalidRTPointer;
- tree->max_val = 0;
+ tree->ctl->root = InvalidRTPointer;
+ tree->ctl->max_val = 0;
}
#ifdef RT_DEBUG
@@ -1062,12 +1117,15 @@ rt_free_node(radix_tree *tree, rt_node_ptr node)
if (i == RT_SIZE_CLASS_COUNT)
i = RT_CLASS_256;
- tree->cnt[i]--;
- Assert(tree->cnt[i] >= 0);
+ tree->ctl->cnt[i]--;
+ Assert(tree->ctl->cnt[i] >= 0);
}
#endif
- pfree(node.decoded);
+ if (RadixTreeIsShared(tree))
+ dsa_free(tree->area, (dsa_pointer) node.encoded);
+ else
+ pfree(node.decoded);
}
/*
@@ -1083,7 +1141,7 @@ rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
if (rt_node_ptr_eq(&parent, &old_child))
{
/* Replace the root node with the new large node */
- tree->root = new_child.encoded;
+ tree->ctl->root = new_child.encoded;
}
else
{
@@ -1105,7 +1163,7 @@ static void
rt_extend(radix_tree *tree, uint64 key)
{
int target_shift;
- rt_node *root = rt_pointer_decode(tree->root);
+ rt_node *root = rt_pointer_decode(tree, tree->ctl->root);
int shift = root->shift + RT_NODE_SPAN;
target_shift = key_get_shift(key);
@@ -1123,15 +1181,15 @@ rt_extend(radix_tree *tree, uint64 key)
n4->base.n.shift = shift;
n4->base.n.count = 1;
n4->base.chunks[0] = 0;
- n4->children[0] = tree->root;
+ n4->children[0] = tree->ctl->root;
root->chunk = 0;
- tree->root = node.encoded;
+ tree->ctl->root = node.encoded;
shift += RT_NODE_SPAN;
}
- tree->max_val = shift_get_max_val(target_shift);
+ tree->ctl->max_val = shift_get_max_val(target_shift);
}
/*
@@ -1163,7 +1221,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
}
rt_node_insert_leaf(tree, parent, node, key, value);
- tree->num_keys++;
+ tree->ctl->num_keys++;
}
/*
@@ -1174,12 +1232,11 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
* pointer is set to child_p.
*/
static inline bool
-rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
- rt_pointer *child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action, rt_pointer *child_p)
{
uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool found = false;
- rt_pointer child;
+ rt_pointer child = InvalidRTPointer;
switch (NODE_KIND(node))
{
@@ -1210,6 +1267,7 @@ rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
break;
found = true;
+
if (action == RT_ACTION_FIND)
child = n32->children[idx];
else /* RT_ACTION_DELETE */
@@ -1761,33 +1819,51 @@ retry_insert_leaf_32:
* Create the radix tree in the given memory context and return it.
*/
radix_tree *
-rt_create(MemoryContext ctx)
+rt_create(MemoryContext ctx, dsa_area *area)
{
radix_tree *tree;
MemoryContext old_ctx;
old_ctx = MemoryContextSwitchTo(ctx);
- tree = palloc(sizeof(radix_tree));
+ tree = (radix_tree *) palloc0(sizeof(radix_tree));
tree->context = ctx;
- tree->root = InvalidRTPointer;
- tree->max_val = 0;
- tree->num_keys = 0;
+
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+
+ tree->area = area;
+ dp = dsa_allocate0(area, sizeof(radix_tree_control));
+ tree->ctl = (radix_tree_control *) dsa_get_address(area, dp);
+ tree->ctl->handle = (rt_handle) dp;
+ }
+ else
+ {
+ tree->ctl = (radix_tree_control *) palloc0(sizeof(radix_tree_control));
+ tree->ctl->handle = InvalidDsaPointer;
+ }
+
+ tree->ctl->magic = RADIXTREE_MAGIC;
+ tree->ctl->root = InvalidRTPointer;
/* Create the slab allocator for each size class */
- for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ if (area == NULL)
{
- tree->inner_slabs[i] = SlabContextCreate(ctx,
- rt_size_class_info[i].name,
- rt_size_class_info[i].inner_blocksize,
- rt_size_class_info[i].inner_size);
- tree->leaf_slabs[i] = SlabContextCreate(ctx,
- rt_size_class_info[i].name,
- rt_size_class_info[i].leaf_blocksize,
- rt_size_class_info[i].leaf_size);
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].leaf_blocksize,
+ rt_size_class_info[i].leaf_size);
#ifdef RT_DEBUG
- tree->cnt[i] = 0;
+ tree->ctl->cnt[i] = 0;
#endif
+ }
}
MemoryContextSwitchTo(old_ctx);
@@ -1795,16 +1871,163 @@ rt_create(MemoryContext ctx)
return tree;
}
+/*
+ * Get a handle that can be used by other processes to attach to this radix
+ * tree.
+ */
+dsa_pointer
+rt_get_handle(radix_tree *tree)
+{
+ Assert(RadixTreeIsShared(tree));
+ Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+ return tree->ctl->handle;
+}
+
+/*
+ * Attach to an existing radix tree using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+radix_tree *
+rt_attach(dsa_area *area, rt_handle handle)
+{
+ radix_tree *tree;
+ dsa_pointer control;
+
+ /* Allocate the backend-local object representing the radix tree */
+ tree = (radix_tree *) palloc0(sizeof(radix_tree));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ /* Set up the local radix tree */
+ tree->area = area;
+ tree->ctl = (radix_tree_control *) dsa_get_address(area, control);
+ Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+ return tree;
+}
+
+/*
+ * Detach from a radix tree. This frees backend-local resources associated
+ * with the radix tree, but the radix tree will continue to exist until
+ * it is explicitly freed.
+ */
+void
+rt_detach(radix_tree *tree)
+{
+ Assert(RadixTreeIsShared(tree));
+ Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+ pfree(tree);
+}
+
+/*
+ * Recursively free all nodes allocated to the dsa area.
+ */
+static void
+rt_free_recurse(radix_tree *tree, rt_pointer ptr)
+{
+ rt_node_ptr node = rt_node_ptr_encoded(tree, ptr);
+
+ Assert(RadixTreeIsShared(tree));
+
+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();
+
+ /* The leaf node doesn't have child pointers, so free it */
+ if (NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->area, (dsa_pointer) node.encoded);
+ return;
+ }
+
+ switch (NODE_KIND(node))
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < NODE_COUNT(node); i++)
+ rt_free_recurse(tree, n4->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < NODE_COUNT(node); i++)
+ rt_free_recurse(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ continue;
+
+ rt_free_recurse(tree, node_inner_125_get_child(n125, i));
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_inner_256_is_chunk_used(n256, i))
+ continue;
+
+ rt_free_recurse(tree, node_inner_256_get_child(n256, i));
+ }
+ break;
+ }
+ }
+
+ /* Free the inner node itself */
+ dsa_free(tree->area, node.encoded);
+}
+
/*
* Free the given radix tree.
*/
void
rt_free(radix_tree *tree)
{
- for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+ if (RadixTreeIsShared(tree))
{
- MemoryContextDelete(tree->inner_slabs[i]);
- MemoryContextDelete(tree->leaf_slabs[i]);
+ /* Free all memory used for radix tree nodes */
+ if (RTPointerIsValid(tree->ctl->root))
+ rt_free_recurse(tree, tree->ctl->root);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->area, tree->ctl->handle);
+ }
+ else
+ {
+ /* Free all memory used for radix tree nodes */
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+ pfree(tree->ctl);
}
pfree(tree);
@@ -1822,16 +2045,18 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
rt_node_ptr node;
rt_node_ptr parent;
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
/* Empty tree, create the root */
- if (!RTPointerIsValid(tree->root))
+ if (!RTPointerIsValid(tree->ctl->root))
rt_new_root(tree, key);
/* Extend the tree if necessary */
- if (key > tree->max_val)
+ if (key > tree->ctl->max_val)
rt_extend(tree, key);
/* Descend the tree until a leaf node */
- node = parent = rt_node_ptr_encoded(tree->root);
+ node = parent = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
while (shift >= 0)
{
@@ -1847,7 +2072,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
}
parent = node;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
}
@@ -1855,7 +2080,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
/* Update the statistics */
if (!updated)
- tree->num_keys++;
+ tree->ctl->num_keys++;
return updated;
}
@@ -1871,12 +2096,13 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
rt_node_ptr node;
int shift;
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
Assert(value_p != NULL);
- if (!RTPointerIsValid(tree->root) || key > tree->max_val)
+ if (!RTPointerIsValid(tree->ctl->root) || key > tree->ctl->max_val)
return false;
- node = rt_node_ptr_encoded(tree->root);
+ node = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
/* Descend the tree until a leaf node */
@@ -1890,7 +2116,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
}
@@ -1910,14 +2136,16 @@ rt_delete(radix_tree *tree, uint64 key)
int level;
bool deleted;
- if (!tree->root || key > tree->max_val)
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+ if (!RTPointerIsValid(tree->ctl->root) || key > tree->ctl->max_val)
return false;
/*
* Descend the tree to search the key while building a stack of nodes we
* visited.
*/
- node = rt_node_ptr_encoded(tree->root);
+ node = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
level = -1;
while (shift > 0)
@@ -1930,7 +2158,7 @@ rt_delete(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
}
@@ -1945,7 +2173,7 @@ rt_delete(radix_tree *tree, uint64 key)
}
/* Found the key to delete. Update the statistics */
- tree->num_keys--;
+ tree->ctl->num_keys--;
/*
* Return if the leaf node still has keys and we don't need to delete the
@@ -1985,16 +2213,18 @@ rt_begin_iterate(radix_tree *tree)
rt_iter *iter;
int top_level;
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
old_ctx = MemoryContextSwitchTo(tree->context);
iter = (rt_iter *) palloc0(sizeof(rt_iter));
iter->tree = tree;
/* empty tree */
- if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->root))
+ if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->ctl->root))
return iter;
- root = rt_node_ptr_encoded(iter->tree->root);
+ root = rt_node_ptr_encoded(tree, iter->tree->ctl->root);
top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
iter->stack_len = top_level;
@@ -2045,8 +2275,10 @@ rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
bool
rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
{
+ Assert(!RadixTreeIsShared(iter->tree) || iter->tree->ctl->magic == RADIXTREE_MAGIC);
+
/* Empty tree */
- if (!iter->tree->root)
+ if (!iter->tree->ctl->root)
return false;
for (;;)
@@ -2190,7 +2422,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *
if (found)
{
rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
- *child_p = rt_node_ptr_encoded(child);
+ *child_p = rt_node_ptr_encoded(iter->tree, child);
}
return found;
@@ -2293,7 +2525,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_
uint64
rt_num_entries(radix_tree *tree)
{
- return tree->num_keys;
+ return tree->ctl->num_keys;
}
/*
@@ -2302,12 +2534,19 @@ rt_num_entries(radix_tree *tree)
uint64
rt_memory_usage(radix_tree *tree)
{
- Size total = sizeof(radix_tree);
+ Size total = sizeof(radix_tree) + sizeof(radix_tree_control);
- for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+ if (RadixTreeIsShared(tree))
+ total = dsa_get_total_size(tree->area);
+ else
{
- total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
- total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
}
return total;
@@ -2391,23 +2630,23 @@ rt_verify_node(rt_node_ptr node)
void
rt_stats(radix_tree *tree)
{
- rt_node *root = rt_pointer_decode(tree->root);
+ rt_node *root = rt_pointer_decode(tree, tree->ctl->root);
if (root == NULL)
return;
ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
- tree->num_keys,
+ tree->ctl->num_keys,
root->shift / RT_NODE_SPAN,
- tree->cnt[RT_CLASS_4_FULL],
- tree->cnt[RT_CLASS_32_PARTIAL],
- tree->cnt[RT_CLASS_32_FULL],
- tree->cnt[RT_CLASS_125_FULL],
- tree->cnt[RT_CLASS_256])));
+ tree->ctl->cnt[RT_CLASS_4_FULL],
+ tree->ctl->cnt[RT_CLASS_32_PARTIAL],
+ tree->ctl->cnt[RT_CLASS_32_FULL],
+ tree->ctl->cnt[RT_CLASS_125_FULL],
+ tree->ctl->cnt[RT_CLASS_256])));
}
static void
-rt_dump_node(rt_node_ptr node, int level, bool recurse)
+rt_dump_node(radix_tree *tree, rt_node_ptr node, int level, bool recurse)
{
rt_node *n = node.decoded;
char space[128] = {0};
@@ -2445,7 +2684,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
space, n4->base.chunks[i]);
if (recurse)
- rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+ rt_dump_node(tree, rt_node_ptr_encoded(tree, n4->children[i]),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2473,7 +2712,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
if (recurse)
{
- rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+ rt_dump_node(tree, rt_node_ptr_encoded(tree, n32->children[i]),
level + 1, recurse);
}
else
@@ -2526,7 +2765,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(rt_node_ptr_encoded(node_inner_125_get_child(n125, i)),
+ rt_dump_node(tree,
+ rt_node_ptr_encoded(tree,
+ node_inner_125_get_child(n125, i)),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2559,7 +2800,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+ rt_dump_node(tree,
+ rt_node_ptr_encoded(tree,
+ node_inner_256_get_child(n256, i)),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2579,28 +2822,28 @@ rt_dump_search(radix_tree *tree, uint64 key)
elog(NOTICE, "-----------------------------------------------------------");
elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
- tree->max_val, tree->max_val);
+ tree->ctl->max_val, tree->ctl->max_val);
- if (!RTPointerIsValid(tree->root))
+ if (!RTPointerIsValid(tree->ctl->root))
{
elog(NOTICE, "tree is empty");
return;
}
- if (key > tree->max_val)
+ if (key > tree->ctl->max_val)
{
elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
key, key);
return;
}
- node = rt_node_ptr_encoded(tree->root);
+ node = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
while (shift >= 0)
{
rt_pointer child;
- rt_dump_node(node, level, false);
+ rt_dump_node(tree, node, level, false);
if (NODE_IS_LEAF(node))
{
@@ -2615,7 +2858,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
break;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
level++;
}
@@ -2633,15 +2876,15 @@ rt_dump(radix_tree *tree)
rt_size_class_info[i].inner_blocksize,
rt_size_class_info[i].leaf_size,
rt_size_class_info[i].leaf_blocksize);
- fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
- if (!RTPointerIsValid(tree->root))
+ if (!RTPointerIsValid(tree->ctl->root))
{
fprintf(stderr, "empty tree\n");
return;
}
- root = rt_node_ptr_encoded(tree->root);
- rt_dump_node(root, 0, true);
+ root = rt_node_ptr_encoded(tree, tree->ctl->root);
+ rt_dump_node(tree, root, 0, true);
}
#endif
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 82376fde2d..ad169882af 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d5d7668617..68a11df970 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -14,18 +14,24 @@
#define RADIXTREE_H
#include "postgres.h"
+#include "utils/dsa.h"
#define RT_DEBUG 1
typedef struct radix_tree radix_tree;
typedef struct rt_iter rt_iter;
+typedef dsa_pointer rt_handle;
-extern radix_tree *rt_create(MemoryContext ctx);
+extern radix_tree *rt_create(MemoryContext ctx, dsa_area *dsa);
extern void rt_free(radix_tree *tree);
extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
extern rt_iter *rt_begin_iterate(radix_tree *tree);
+extern rt_handle rt_get_handle(radix_tree *tree);
+extern radix_tree *rt_attach(dsa_area *dsa, dsa_pointer dp);
+extern void rt_detach(radix_tree *tree);
+
extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
extern void rt_end_iterate(rt_iter *iter);
extern bool rt_delete(radix_tree *tree, uint64 key);
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 405606fe2f..dad06adecc 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index ce645cb8b5..a217e0d312 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -6,28 +6,53 @@ CREATE EXTENSION test_radixtree;
SELECT test_radixtree();
NOTICE: testing basic operations with leaf node 4
NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
NOTICE: testing basic operations with leaf node 32
NOTICE: testing basic operations with inner node 32
NOTICE: testing basic operations with leaf node 125
NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
NOTICE: testing basic operations with leaf node 256
NOTICE: testing basic operations with inner node 256
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
NOTICE: testing radix tree node types with shift "0"
NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "8"
NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
NOTICE: testing radix tree node types with shift "24"
NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "32"
NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
NOTICE: testing radix tree node types with shift "48"
NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree node types with shift "56"
NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
NOTICE: testing radix tree with pattern "alternating bits"
NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of ten"
NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
NOTICE: testing radix tree with pattern "one-every-64k"
NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "sparse"
NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
test_radixtree
----------------
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index ea993e63df..fe1e168ec4 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -19,6 +19,7 @@
#include "nodes/bitmapset.h"
#include "storage/block.h"
#include "storage/itemptr.h"
+#include "storage/lwlock.h"
#include "utils/memutils.h"
#include "utils/timestamp.h"
@@ -99,6 +100,8 @@ static const test_spec test_specs[] = {
}
};
+static int lwlock_tranche_id;
+
PG_MODULE_MAGIC;
PG_FUNCTION_INFO_V1(test_radixtree);
@@ -112,7 +115,7 @@ test_empty(void)
uint64 key;
uint64 val;
- radixtree = rt_create(CurrentMemoryContext);
+ radixtree = rt_create(CurrentMemoryContext, NULL);
if (rt_search(radixtree, 0, &dummy))
elog(ERROR, "rt_search on empty tree returned true");
@@ -140,17 +143,14 @@ test_empty(void)
}
static void
-test_basic(int children, bool test_inner)
+do_test_basic(radix_tree *radixtree, int children, bool test_inner)
{
- radix_tree *radixtree;
uint64 *keys;
int shift = test_inner ? 8 : 0;
elog(NOTICE, "testing basic operations with %s node %d",
test_inner ? "inner" : "leaf", children);
- radixtree = rt_create(CurrentMemoryContext);
-
/* prepare keys in order like 1, 32, 2, 31, 2, ... */
keys = palloc(sizeof(uint64) * children);
for (int i = 0; i < children; i++)
@@ -165,7 +165,7 @@ test_basic(int children, bool test_inner)
for (int i = 0; i < children; i++)
{
if (rt_set(radixtree, keys[i], keys[i]))
- elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found %d", keys[i], i);
}
/* update keys */
@@ -185,7 +185,38 @@ test_basic(int children, bool test_inner)
}
pfree(keys);
- rt_free(radixtree);
+}
+
+static void
+test_basic()
+{
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ radix_tree *tree;
+ dsa_area *area;
+
+ /* Test the local radix tree */
+ tree = rt_create(CurrentMemoryContext, NULL);
+ do_test_basic(tree, rt_node_kind_fanouts[i], false);
+ rt_free(tree);
+
+ tree = rt_create(CurrentMemoryContext, NULL);
+ do_test_basic(tree, rt_node_kind_fanouts[i], true);
+ rt_free(tree);
+
+ /* Test the shared radix tree */
+ area = dsa_create(lwlock_tranche_id);
+ tree = rt_create(CurrentMemoryContext, area);
+ do_test_basic(tree, rt_node_kind_fanouts[i], false);
+ rt_free(tree);
+ dsa_detach(area);
+
+ area = dsa_create(lwlock_tranche_id);
+ tree = rt_create(CurrentMemoryContext, area);
+ do_test_basic(tree, rt_node_kind_fanouts[i], true);
+ rt_free(tree);
+ dsa_detach(area);
+ }
}
/*
@@ -286,14 +317,10 @@ test_node_types_delete(radix_tree *radixtree, uint8 shift)
* level.
*/
static void
-test_node_types(uint8 shift)
+do_test_node_types(radix_tree *radixtree, uint8 shift)
{
- radix_tree *radixtree;
-
elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
- radixtree = rt_create(CurrentMemoryContext);
-
/*
* Insert and search entries for every node type at the 'shift' level,
* then delete all entries to make it empty, and insert and search entries
@@ -302,19 +329,37 @@ test_node_types(uint8 shift)
test_node_types_insert(radixtree, shift, true);
test_node_types_delete(radixtree, shift);
test_node_types_insert(radixtree, shift, false);
+}
- rt_free(radixtree);
+static void
+test_node_types(void)
+{
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ {
+ radix_tree *tree;
+ dsa_area *area;
+
+ /* Test the local radix tree */
+ tree = rt_create(CurrentMemoryContext, NULL);
+ do_test_node_types(tree, shift);
+ rt_free(tree);
+
+ /* Test the shared radix tree */
+ area = dsa_create(lwlock_tranche_id);
+ tree = rt_create(CurrentMemoryContext, area);
+ do_test_node_types(tree, shift);
+ rt_free(tree);
+ dsa_detach(area);
+ }
}
/*
* Test with a repeating pattern, defined by the 'spec'.
*/
static void
-test_pattern(const test_spec * spec)
+do_test_pattern(radix_tree *radixtree, const test_spec * spec)
{
- radix_tree *radixtree;
rt_iter *iter;
- MemoryContext radixtree_ctx;
TimestampTz starttime;
TimestampTz endtime;
uint64 n;
@@ -340,18 +385,6 @@ test_pattern(const test_spec * spec)
pattern_values[pattern_num_values++] = i;
}
- /*
- * Allocate the radix tree.
- *
- * Allocate it in a separate memory context, so that we can print its
- * memory usage easily.
- */
- radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
- "radixtree test",
- ALLOCSET_SMALL_SIZES);
- MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
- radixtree = rt_create(radixtree_ctx);
-
/*
* Add values to the set.
*/
@@ -405,8 +438,6 @@ test_pattern(const test_spec * spec)
mem_usage = rt_memory_usage(radixtree);
fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
mem_usage, (double) mem_usage / spec->num_values);
-
- MemoryContextStats(radixtree_ctx);
}
/* Check that rt_num_entries works */
@@ -555,27 +586,57 @@ test_pattern(const test_spec * spec)
if ((nbefore - ndeleted) != nafter)
elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
nafter, (nbefore - ndeleted), ndeleted);
+}
+
+static void
+test_patterns(void)
+{
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ {
+ radix_tree *tree;
+ MemoryContext radixtree_ctx;
+ dsa_area *area;
+ const test_spec *spec = &test_specs[i];
- MemoryContextDelete(radixtree_ctx);
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+ /* Test the local radix tree */
+ tree = rt_create(radixtree_ctx, NULL);
+ do_test_pattern(tree, spec);
+ rt_free(tree);
+ MemoryContextReset(radixtree_ctx);
+
+ /* Test the shared radix tree */
+ area = dsa_create(lwlock_tranche_id);
+ tree = rt_create(radixtree_ctx, area);
+ do_test_pattern(tree, spec);
+ rt_free(tree);
+ dsa_detach(area);
+ MemoryContextDelete(radixtree_ctx);
+ }
}
Datum
test_radixtree(PG_FUNCTION_ARGS)
{
- test_empty();
+ /* get a new lwlock tranche id for all tests for shared radix tree */
+ lwlock_tranche_id = LWLockNewTrancheId();
- for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
- {
- test_basic(rt_node_kind_fanouts[i], false);
- test_basic(rt_node_kind_fanouts[i], true);
- }
-
- for (int shift = 0; shift <= (64 - 8); shift += 8)
- test_node_types(shift);
+ test_empty();
+ test_basic();
- /* Test different test patterns, with lots of entries */
- for (int i = 0; i < lengthof(test_specs); i++)
- test_pattern(&test_specs[i]);
+ test_node_types();
+ test_patterns();
PG_RETURN_VOID();
}
--
2.31.1
v14-0004-Use-bitmapword-for-node-125.patchapplication/octet-stream; name=v14-0004-Use-bitmapword-for-node-125.patchDownload
From 066eada2c94025a273fa0e49763c6817fcc1906a Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 15:22:26 +0700
Subject: [PATCH v14 4/9] Use bitmapword for node-125
TODO: Rename macros copied from bitmapset.c
---
src/backend/lib/radixtree.c | 70 ++++++++++++++++++-------------------
1 file changed, 34 insertions(+), 36 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index e7f61fd943..abd0450727 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -62,6 +62,7 @@
#include "lib/radixtree.h"
#include "lib/stringinfo.h"
#include "miscadmin.h"
+#include "nodes/bitmapset.h"
#include "port/pg_bitutils.h"
#include "port/pg_lfind.h"
#include "utils/memutils.h"
@@ -103,6 +104,10 @@
#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+/* FIXME rename */
+#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD)
+
/* Enum used rt_node_search() */
typedef enum
{
@@ -207,6 +212,9 @@ typedef struct rt_node_base125
/* The index of slots for each fanout */
uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[WORDNUM(128)];
} rt_node_base_125;
typedef struct rt_node_base256
@@ -271,9 +279,6 @@ typedef struct rt_node_leaf_125
{
rt_node_base_125 base;
- /* isset is a bitmap to track which slot is in use */
- uint8 isset[RT_NODE_NSLOTS_BITS(128)];
-
/* number of values depends on size class */
uint64 values[FLEXIBLE_ARRAY_MEMBER];
} rt_node_leaf_125;
@@ -655,13 +660,14 @@ node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
}
+#ifdef USE_ASSERT_CHECKING
/* Is the slot in the node used? */
static inline bool
node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
{
Assert(!NODE_IS_LEAF(node));
Assert(slot < node->base.n.fanout);
- return (node->children[slot] != NULL);
+ return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
}
static inline bool
@@ -669,8 +675,9 @@ node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
{
Assert(NODE_IS_LEAF(node));
Assert(slot < node->base.n.fanout);
- return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+ return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
}
+#endif
static inline rt_node *
node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
@@ -690,7 +697,10 @@ node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
static void
node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
{
+ int slotpos = node->base.slot_idxs[chunk];
+
Assert(!NODE_IS_LEAF(node));
+ node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
node->children[node->base.slot_idxs[chunk]] = NULL;
node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
}
@@ -701,44 +711,35 @@ node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
int slotpos = node->base.slot_idxs[chunk];
Assert(NODE_IS_LEAF(node));
- node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+ node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
}
/* Return an unused slot in node-125 */
static int
-node_inner_125_find_unused_slot(rt_node_inner_125 *node, uint8 chunk)
-{
- int slotpos = 0;
-
- Assert(!NODE_IS_LEAF(node));
- while (node_inner_125_is_slot_used(node, slotpos))
- slotpos++;
-
- return slotpos;
-}
-
-static int
-node_leaf_125_find_unused_slot(rt_node_leaf_125 *node, uint8 chunk)
+node_125_find_unused_slot(bitmapword *isset)
{
int slotpos;
+ int idx;
+ bitmapword inverse;
- Assert(NODE_IS_LEAF(node));
-
- /* We iterate over the isset bitmap per byte then check each bit */
- for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < WORDNUM(128); idx++)
{
- if (node->isset[slotpos] < 0xFF)
+ if (isset[idx] < ~((bitmapword) 0))
break;
}
- Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
- slotpos *= BITS_PER_BYTE;
- while (node_leaf_125_is_slot_used(node, slotpos))
- slotpos++;
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+
+ /* mark the slot used */
+ isset[idx] |= bmw_rightmost_one(inverse);
return slotpos;
-}
+ }
static inline void
node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
@@ -747,8 +748,7 @@ node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
Assert(!NODE_IS_LEAF(node));
- /* find unused slot */
- slotpos = node_inner_125_find_unused_slot(node, chunk);
+ slotpos = node_125_find_unused_slot(node->base.isset);
Assert(slotpos < node->base.n.fanout);
node->base.slot_idxs[chunk] = slotpos;
@@ -763,12 +763,10 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
Assert(NODE_IS_LEAF(node));
- /* find unused slot */
- slotpos = node_leaf_125_find_unused_slot(node, chunk);
+ slotpos = node_125_find_unused_slot(node->base.isset);
Assert(slotpos < node->base.n.fanout);
node->base.slot_idxs[chunk] = slotpos;
- node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
node->values[slotpos] = value;
}
@@ -2395,9 +2393,9 @@ rt_dump_node(rt_node *node, int level, bool recurse)
rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
fprintf(stderr, ", isset-bitmap:");
- for (int i = 0; i < 16; i++)
+ for (int i = 0; i < WORDNUM(128); i++)
{
- fprintf(stderr, "%X ", (uint8) n->isset[i]);
+ fprintf(stderr, UINT64_FORMAT_HEX " ", n->base.isset[i]);
}
fprintf(stderr, "\n");
}
--
2.31.1
v14-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchapplication/octet-stream; name=v14-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload
From caf11ea2ca608edac00443b6ab7590688385b0d4 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v14 2/9] Move some bitmap logic out of bitmapset.c
Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.
Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().
Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.
Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
src/backend/nodes/bitmapset.c | 36 ++------------------------------
src/include/nodes/bitmapset.h | 16 ++++++++++++--
src/include/port/pg_bitutils.h | 31 +++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 -
4 files changed, 47 insertions(+), 37 deletions(-)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index b7b274aeff..4384ff591d 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x))
/*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
{
int result;
- w = RIGHTMOST_ONE(w);
+ w = bmw_rightmost_one(w);
a->words[wordnum] &= ~w;
result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 2792281658..fdc504596b 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
#else
#define BITS_PER_BITMAPWORD 32
typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
#endif
@@ -75,6 +73,20 @@ typedef enum
BMS_MULTIPLE /* >1 member */
} BMS_Membership;
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w) pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
+#define bmw_popcount(w) pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w) pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
+#define bmw_popcount(w) pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
/*
* function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 814e0b2dba..f95b6afd86 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+ int32 result = (int32) word & -((int32) word);
+ return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+ int64 result = (int64) word & -((int64) word);
+ return (uint64) result;
+}
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 60c71d05fe..8305f09f2c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3654,7 +3654,6 @@ shmem_request_hook_type
shmem_startup_hook_type
sig_atomic_t
sigjmp_buf
-signedbitmapword
sigset_t
size_t
slist_head
--
2.31.1
v14-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/octet-stream; name=v14-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload
From ceaf56be51d2c686a795e1ab1ab40f701ed21d62 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v14 1/9] introduce vector8_min and vector8_highbit_mask
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..0b288c422a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
static inline bool vector8_has_zero(const Vector8 v);
static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
#endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
#endif
}
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+ uint32 mask = 0;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+ return mask;
+#endif
+}
+
/*
* Exactly like vector8_is_highbit_set except for the input type, so it
* looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.31.1
v14-0003-Add-radix-implementation.patchapplication/octet-stream; name=v14-0003-Add-radix-implementation.patchDownload
From 6ba6c9979b2bd4fb5ef3c61d7a6edac1737e8509 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v14 3/9] Add radix implementation.
---
src/backend/lib/Makefile | 1 +
src/backend/lib/meson.build | 1 +
src/backend/lib/radixtree.c | 2541 +++++++++++++++++
src/include/lib/radixtree.h | 42 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 36 +
src/test/modules/test_radixtree/meson.build | 34 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 581 ++++
.../test_radixtree/test_radixtree.control | 4 +
15 files changed, 3291 insertions(+)
create mode 100644 src/backend/lib/radixtree.c
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
integerset.o \
knapsack.o \
pairingheap.o \
+ radixtree.o \
rbtree.o \
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 48da1bddce..4303d306cd 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -9,4 +9,5 @@ backend_sources += files(
'knapsack.c',
'pairingheap.c',
'rbtree.c',
+ 'radixtree.c',
)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..e7f61fd943
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2541 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves". We
+ * choose it to avoid an additional pointer traversal. It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create - Create a new, empty radix tree
+ * rt_free - Free the radix tree
+ * rt_search - Search a key-value pair
+ * rt_set - Set a key-value pair
+ * rt_delete - Delete a key-value pair
+ * rt_begin_iterate - Begin iterating through all key-value pairs
+ * rt_iterate_next - Return next key-value pair, if any
+ * rt_end_iter - End iteration
+ * rt_memory_usage - Get the memory usage
+ * rt_num_entries - Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+ RT_ACTION_FIND = 0, /* find the key-value */
+ RT_ACTION_DELETE, /* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of rt_node. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+typedef enum rt_size_class
+{
+ RT_CLASS_4_FULL = 0,
+ RT_CLASS_32_PARTIAL,
+ RT_CLASS_32_FULL,
+ RT_CLASS_125_FULL,
+ RT_CLASS_256
+
+#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
+} rt_size_class;
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /* Max number of children. We can use uint8 because we never need to store 256 */
+ /* WIP: if we don't have a variable sized node4, this should instead be in the base
+ types as needed, since saving every byte is crucial for the smallest node kind */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+ uint8 chunk;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} rt_node;
+#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+ ((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+ ((node)->base.n.count < rt_size_class_info[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct rt_node_base_4
+{
+ rt_node n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+ rt_node n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base125
+{
+ rt_node n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+} rt_node_base_125;
+
+typedef struct rt_node_base256
+{
+ rt_node n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ * width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+ rt_node_base_4 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+ rt_node_base_4 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+ rt_node_base_32 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+ rt_node_base_32 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_125
+{
+ rt_node_base_125 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_125;
+
+typedef struct rt_node_leaf_125
+{
+ rt_node_base_125 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(128)];
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+ rt_node_base_256 base;
+
+ /* Slots for 256 children */
+ rt_node *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+ rt_node_base_256 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ uint64 values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information for each size class */
+typedef struct rt_size_class_elem
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+
+ /* slab block size */
+ Size inner_blocksize;
+ Size leaf_blocksize;
+} rt_size_class_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
+ [RT_CLASS_4_FULL] = {
+ .name = "radix tree node 4",
+ .fanout = 4,
+ .inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_PARTIAL] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_FULL] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+ },
+ [RT_CLASS_125_FULL] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(rt_node_inner_256),
+ .leaf_size = sizeof(rt_node_leaf_256),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+ },
+};
+
+/* Map from the node kind to its minimum size class */
+static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
+ [RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+ [RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+ [RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+ [RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+ rt_node *node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+ radix_tree *tree;
+
+ /* Track the iteration on nodes of each level */
+ rt_node_iter stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ rt_node *root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+ bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+ rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+ uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+ uint8 *dst_chunks, rt_node **dst_children)
+{
+ const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(rt_node *) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+ uint8 *dst_chunks, uint64 *dst_values)
+{
+ const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(uint64) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(slot < node->base.n.fanout);
+ return (node->children[slot] != NULL);
+}
+
+static inline bool
+node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(slot < node->base.n.fanout);
+ return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+static inline rt_node *
+node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+static void
+node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[node->base.slot_idxs[chunk]] = NULL;
+ node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+static void
+node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
+{
+ int slotpos = node->base.slot_idxs[chunk];
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+ node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+/* Return an unused slot in node-125 */
+static int
+node_inner_125_find_unused_slot(rt_node_inner_125 *node, uint8 chunk)
+{
+ int slotpos = 0;
+
+ Assert(!NODE_IS_LEAF(node));
+ while (node_inner_125_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+static int
+node_leaf_125_find_unused_slot(rt_node_leaf_125 *node, uint8 chunk)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ /* We iterate over the isset bitmap per byte then check each bit */
+ for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+ {
+ if (node->isset[slotpos] < 0xFF)
+ break;
+ }
+ Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+ slotpos *= BITS_PER_BYTE;
+ while (node_leaf_125_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+static inline void
+node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+ int slotpos;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ /* find unused slot */
+ slotpos = node_inner_125_find_unused_slot(node, chunk);
+ Assert(slotpos < node->base.n.fanout);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ /* find unused slot */
+ slotpos = node_leaf_125_find_unused_slot(node, chunk);
+ Assert(slotpos < node->base.n.fanout);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+ node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+static inline void
+node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(node_inner_256_is_chunk_used(node, chunk));
+ return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(node_leaf_256_is_chunk_used(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+ int shift = key_get_shift(key);
+ bool inner = shift > 0;
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+ rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newnode->shift = shift;
+ tree->max_val = shift_get_max_val(shift);
+ tree->root = newnode;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
+{
+ rt_node *newnode;
+
+ if (inner)
+ newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+ rt_size_class_info[size_class].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ rt_size_class_info[size_class].leaf_size);
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[size_class]++;
+#endif
+
+ return newnode;
+}
+
+/* Initialize the node contents */
+static inline void
+rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+{
+ if (inner)
+ MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+ else
+ MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+
+ node->kind = kind;
+ node->fanout = rt_size_class_info[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+
+ memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+ }
+
+ /*
+ * Technically it's 256, but we cannot store that in a uint8,
+ * and this is the max size class to it will never grow.
+ */
+ if (kind == RT_NODE_KIND_256)
+ node->fanout = 0;
+}
+
+static inline void
+rt_copy_node(rt_node *newnode, rt_node *oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->chunk = oldnode->chunk;
+ newnode->count = oldnode->count;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node*
+rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+{
+ rt_node *newnode;
+ bool inner = !NODE_IS_LEAF(node);
+
+ newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
+ rt_init_node(newnode, new_kind, kind_min_size_class[new_kind], inner);
+ rt_copy_node(newnode, node);
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->root == node)
+ {
+ tree->root = NULL;
+ tree->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == rt_size_class_info[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->cnt[i]--;
+ Assert(tree->cnt[i] >= 0);
+ }
+#endif
+
+ pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+ rt_node *new_child, uint64 key)
+{
+ Assert(old_child->chunk == new_child->chunk);
+ Assert(old_child->shift == new_child->shift);
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new large node */
+ tree->root = new_child;
+ }
+ else
+ {
+ bool replaced PG_USED_FOR_ASSERTS_ONLY;
+
+ replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+ Assert(replaced);
+ }
+
+ rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+ int target_shift;
+ int shift = tree->root->shift + RT_NODE_SPAN;
+
+ target_shift = key_get_shift(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ rt_node_inner_4 *node;
+
+ node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+ rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+ node->base.n.shift = shift;
+ node->base.n.count = 1;
+ node->base.chunks[0] = 0;
+ node->children[0] = tree->root;
+
+ tree->root->chunk = 0;
+ tree->root = (rt_node *) node;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+ rt_node *node)
+{
+ int shift = node->shift;
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ rt_node *newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool inner = newshift > 0;
+
+ newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+ rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newchild->shift = newshift;
+ newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ rt_node_insert_inner(tree, parent, node, key, newchild);
+
+ parent = node;
+ node = newchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ rt_node_insert_leaf(tree, parent, node, key, value);
+ tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ rt_node *child = NULL;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = n4->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n4->base.chunks, n4->children,
+ n4->base.n.count, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = n32->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = node_inner_125_get_child(n125, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_125_delete(n125, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = node_inner_256_get_child(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ if (found && child_p)
+ *child_p = child;
+
+ return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ uint64 value = 0;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = n4->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+ n4->base.n.count, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = n32->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+ n32->base.n.count, idx);
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_125_get_value(n125, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_125_delete(n125, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_256_get_value(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ if (found && value_p)
+ *value_p = value;
+
+ return found;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+ rt_node *child)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->children[idx] = child;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ {
+ rt_node_inner_32 *new32;
+ Assert(parent != NULL);
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_children_array_copy(n4->base.chunks, n4->children,
+ new32->base.chunks, new32->children);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+ key);
+ node = (rt_node *) new32;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ uint16 count = n4->base.n.count;
+
+ /* shift chunks and children */
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n4->base.chunks, n4->children,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->children[insertpos] = child;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->children[idx] = child;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ {
+ Assert(parent != NULL);
+
+ if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+ {
+ /* use the same node kind, but expand to the next size class */
+ const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size;
+ const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+ rt_node_inner_32 *new32;
+
+ new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+ memcpy(new32, n32, size);
+ new32->base.n.fanout = fanout;
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+ /* must update both pointers here */
+ node = (rt_node *) new32;
+ n32 = new32;
+
+ goto retry_insert_inner_32;
+ }
+ else
+ {
+ rt_node_inner_125 *new125;
+
+ /* grow node from 32 to 125 */
+ new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+ RT_NODE_KIND_125);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
+ node = (rt_node *) new125;
+ }
+ }
+ else
+ {
+retry_insert_inner_32:
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int16 count = n32->base.n.count;
+
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n32->base.chunks, n32->children,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->children[insertpos] = child;
+ break;
+ }
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+ int cnt = 0;
+
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ node_inner_125_update(n125, chunk, child);
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ {
+ rt_node_inner_256 *new256;
+ Assert(parent != NULL);
+
+ /* grow node from 125 to 256 */
+ new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ continue;
+
+ node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
+ cnt++;
+ }
+
+ rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ else
+ {
+ node_inner_125_insert(n125, chunk, child);
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+ node_inner_256_set(n256, chunk, child);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(NODE_IS_LEAF(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->values[idx] = value;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ {
+ rt_node_leaf_32 *new32;
+ Assert(parent != NULL);
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_values_array_copy(n4->base.chunks, n4->values,
+ new32->base.chunks, new32->values);
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
+ node = (rt_node *) new32;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ int count = n4->base.n.count;
+
+ /* shift chunks and values */
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n4->base.chunks, n4->values,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->values[insertpos] = value;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->values[idx] = value;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ {
+ Assert(parent != NULL);
+
+ if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+ {
+ /* use the same node kind, but expand to the next size class */
+ const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
+ const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+ rt_node_leaf_32 *new32;
+
+ new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+ memcpy(new32, n32, size);
+ new32->base.n.fanout = fanout;
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+ /* must update both pointers here */
+ node = (rt_node *) new32;
+ n32 = new32;
+
+ goto retry_insert_leaf_32;
+ }
+ else
+ {
+ rt_node_leaf_125 *new125;
+
+ /* grow node from 32 to 125 */
+ new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+ RT_NODE_KIND_125);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
+ key);
+ node = (rt_node *) new125;
+ }
+ }
+ else
+ {
+ retry_insert_leaf_32:
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int count = n32->base.n.count;
+
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n32->base.chunks, n32->values,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->values[insertpos] = value;
+ break;
+ }
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+ int cnt = 0;
+
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ node_leaf_125_update(n125, chunk, value);
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ {
+ rt_node_leaf_256 *new256;
+ Assert(parent != NULL);
+
+ /* grow node from 125 to 256 */
+ new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ continue;
+
+ node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
+ cnt++;
+ }
+
+ rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ else
+ {
+ node_leaf_125_insert(n125, chunk, value);
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+ node_leaf_256_set(n256, chunk, value);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+ radix_tree *tree;
+ MemoryContext old_ctx;
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = palloc(sizeof(radix_tree));
+ tree->context = ctx;
+ tree->root = NULL;
+ tree->max_val = 0;
+ tree->num_keys = 0;
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].leaf_blocksize,
+ rt_size_class_info[i].leaf_size);
+#ifdef RT_DEBUG
+ tree->cnt[i] = 0;
+#endif
+ }
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+ int shift;
+ bool updated;
+ rt_node *node;
+ rt_node *parent;
+
+ /* Empty tree, create the root */
+ if (!tree->root)
+ rt_new_root(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->max_val)
+ rt_extend(tree, key);
+
+ Assert(tree->root);
+
+ shift = tree->root->shift;
+ node = parent = tree->root;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ {
+ rt_set_extend(tree, key, value, parent, node);
+ return false;
+ }
+
+ parent = node;
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->num_keys++;
+
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+ rt_node *node;
+ int shift;
+
+ Assert(value_p != NULL);
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ node = tree->root;
+ shift = tree->root->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ rt_node *stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ node = tree->root;
+ shift = tree->root->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ rt_node *child;
+
+ /* Push the current node to the stack */
+ stack[++level] = node;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ Assert(NODE_IS_LEAF(node));
+ deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (!NODE_IS_EMPTY(node))
+ return true;
+
+ /* Free the empty leaf node */
+ rt_free_node(tree, node);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ node = stack[level--];
+
+ deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!NODE_IS_EMPTY(node))
+ break;
+
+ /* The node became empty */
+ rt_free_node(tree, node);
+ }
+
+ return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+ MemoryContext old_ctx;
+ rt_iter *iter;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (rt_iter *) palloc0(sizeof(rt_iter));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree->root)
+ return iter;
+
+ top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+ int level = from;
+ rt_node *node = from_node;
+
+ for (;;)
+ {
+ rt_node_iter *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = rt_node_inner_iterate_next(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->root)
+ return false;
+
+ for (;;)
+ {
+ rt_node *child = NULL;
+ uint64 value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ rt_update_iter_stack(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+ pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+ rt_node *child = NULL;
+ bool found = false;
+ uint8 key_chunk;
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+
+ child = n4->children[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+ child = n32->children[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_125_get_child(n125, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_inner_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_256_get_child(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+ return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p)
+{
+ rt_node *node = node_iter->node;
+ bool found = false;
+ uint64 value;
+ uint8 key_chunk;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+
+ value = n4->values[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+ value = n32->values[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_125_get_value(n125, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_leaf_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_256_get_value(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ *value_p = value;
+ }
+
+ return found;
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+ return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+ Size total = sizeof(radix_tree);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+ for (int i = 1; i < n4->n.count; i++)
+ Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ if (NODE_IS_LEAF(node))
+ Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) node,
+ n125->slot_idxs[i]));
+ else
+ Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) node,
+ n125->slot_idxs[i]));
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+ cnt += pg_popcount32(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+ ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+ tree->num_keys,
+ tree->root->shift / RT_NODE_SPAN,
+ tree->cnt[RT_CLASS_4_FULL],
+ tree->cnt[RT_CLASS_32_PARTIAL],
+ tree->cnt[RT_CLASS_32_FULL],
+ tree->cnt[RT_CLASS_125_FULL],
+ tree->cnt[RT_CLASS_256])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+ char space[125] = {0};
+
+ fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
+ NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift, node->chunk);
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n4->base.chunks[i], n4->values[i]);
+ }
+ else
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n4->base.chunks[i]);
+
+ if (recurse)
+ rt_dump_node(n4->children[i], level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n32->base.chunks[i], n32->values[i]);
+ }
+ else
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ rt_dump_node(n32->children[i], level + 1, recurse);
+ }
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+
+ fprintf(stderr, "slot_idxs ");
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used(b125, i))
+ continue;
+
+ fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+ }
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+
+ fprintf(stderr, ", isset-bitmap:");
+ for (int i = 0; i < 16; i++)
+ {
+ fprintf(stderr, "%X ", (uint8) n->isset[i]);
+ }
+ fprintf(stderr, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used(b125, i))
+ continue;
+
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, node_leaf_125_get_value(n125, i));
+ }
+ else
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_125_get_child(n125, i),
+ level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, node_leaf_256_get_value(n256, i));
+ }
+ else
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+ recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+ tree->max_val, tree->max_val);
+
+ if (!tree->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->max_val)
+ {
+ elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+ key, key);
+ return;
+ }
+
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ rt_dump_node(node, level, false);
+
+ if (NODE_IS_LEAF(node))
+ {
+ uint64 dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+ break;
+ }
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_size,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].leaf_size,
+ rt_size_class_info[i].leaf_blocksize);
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+
+ if (!tree->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif /* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
test_pg_db_role_setting \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 911a768a29..fd101e3bf4 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -22,6 +22,7 @@ subdir('test_parser')
subdir('test_pg_db_role_setting')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..ea993e63df
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,581 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int rt_node_kind_fanouts[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 125, /* RT_NODE_KIND_125 */
+ 256 /* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ uint64 dummy;
+ uint64 key;
+ uint64 val;
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_set(radixtree, keys[i], keys[i] + 1))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ uint64 val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx - 1]
+ : rt_node_kind_fanouts[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx]
+ : rt_node_kind_fanouts[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ radix_tree *radixtree;
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+ radixtree = rt_create(radixtree_ctx);
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ test_basic(rt_node_kind_fanouts[i], false);
+ test_basic(rt_node_kind_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
--
2.31.1
On Mon, Dec 19, 2022 at 4:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Dec 12, 2022 at 7:14 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:
I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more detail what exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do some useful work and not fail).
The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
the radix tree is using dsa_get_total_size(). If the size returned by
dsa_get_total_size() (+ some memory used by TIDStore meta information)
exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
and heap vacuum. However, when allocating DSA memory for
radix_tree_control at creation, we allocate 1MB
(DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
radix_tree_control from it. das_get_total_size() returns 1MB even if
there is no TID collected.2MB makes sense.
If the metadata is small, it seems counterproductive to count it towards the total. We want the decision to be driven by blocks allocated. I have an idea on that below.
Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytes to reduce contention on the tidstore. We could use such an array even for a single worker (always doing the same thing is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insert into the store, and check the store's memory usage before continuing.
Right, I think it's no problem in slab cases. In DSA cases, the new
segment size follows a geometric series that approximately doubles the
total storage each time we create a new segment. This behavior comes
from the fact that the underlying DSM system isn't designed for large
numbers of segments.And taking a look, the size of a new segment can get quite large. It seems we could test if the total DSA area allocated is greater than half of maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, then the area will contain just under a power of two the last time it passes the test. The next segment will bring it to about 3/4 full, like this:
maintenance work mem = 256MB, so stop if we go over 128MB:
2*(1+2+4+8+16+32) = 126MB -> keep going
126MB + 64 = 190MB -> stopThat would be a simple way to be conservative with the memory limit. The unfortunate aspect is that the last segment would be mostly wasted, but it's paradise compared to the pessimistically-sized single array we have now (even with Peter G.'s VM snapshot informing the allocation size, I imagine).
Right. In this case, even if we allocate 64MB, we will use only 2088
bytes at maximum. So I think the memory space used for vacuum is
practically limited to half.And as for minimum possible maintenance work mem, I think this would work with 2MB, if the community is okay with technically going over the limit by a few bytes of overhead if a buildfarm animal set to that value. I imagine it would never go over the limit for realistic (and even most unrealistic) values. Even with a VM snapshot page in memory and small local arrays of TIDs, I think with this scheme we'll be well under the limit.
Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
seems that they look at only memory that are actually dsa_allocate'd.
To be exact, we estimate the number of hash buckets based on work_mem
(and hash_mem_multiplier) and use it as the upper limit. So I've
confirmed that the result of dsa_get_total_size() could exceed the
limit. I'm not sure it's a known and legitimate usage. If we can
follow such usage, we can probably track how much dsa_allocate'd
memory is used in the radix tree.I've experimented with this idea. The newly added 0008 patch changes
the radix tree so that it counts the memory usage for both local and
shared cases.
I've attached updated version patches to make cfbot happy.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v15-0008-PoC-calculate-memory-usage-in-radix-tree.patchapplication/octet-stream; name=v15-0008-PoC-calculate-memory-usage-in-radix-tree.patchDownload
From 8ec7c3f15da739c1a8d78c1eec1e1f45cbe8ba21 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 19 Dec 2022 14:41:43 +0900
Subject: [PATCH v15 8/9] PoC: calculate memory usage in radix tree.
---
src/backend/lib/radixtree.c | 137 +++++++++++++++++++++++------------
src/backend/utils/mmgr/dsa.c | 42 +++++++++++
src/include/utils/dsa.h | 1 +
3 files changed, 135 insertions(+), 45 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 455071cbab..4ad55a0b7c 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -360,14 +360,24 @@ typedef struct rt_size_class_elem
const char *name;
int fanout;
- /* slab chunk size */
+ /* node size */
Size inner_size;
Size leaf_size;
/* slab block size */
- Size inner_blocksize;
- Size leaf_blocksize;
+ Size slab_inner_blocksize;
+ Size slab_leaf_blocksize;
+
+ /*
+ * We can get how much memory is allocated for a radix tree node using
+ * GetMemoryChunkSpace() for the local radix tree case. However, in the
+ * shared case, since DSA doesn't have such functionality we prepare the
+ * node size that are allocated in DSA for memory calculation.
+ */
+ Size dsa_inner_size;
+ Size dsa_leaf_size;
} rt_size_class_elem;
+static bool rt_size_class_dsa_info_initialized = false;
/*
* Calculate the slab blocksize so that we can allocate at least 32 chunks
@@ -381,40 +391,40 @@ static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
.fanout = 4,
.inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
.leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+ .slab_inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+ .slab_leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
},
[RT_CLASS_32_PARTIAL] = {
.name = "radix tree node 15",
.fanout = 15,
.inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
.leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
+ .slab_inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+ .slab_leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
},
[RT_CLASS_32_FULL] = {
.name = "radix tree node 32",
.fanout = 32,
.inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
.leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+ .slab_inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+ .slab_leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
},
[RT_CLASS_125_FULL] = {
.name = "radix tree node 125",
.fanout = 125,
.inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
.leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
+ .slab_inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
+ .slab_leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
},
[RT_CLASS_256] = {
.name = "radix tree node 256",
.fanout = 256,
.inner_size = sizeof(rt_node_inner_256),
.leaf_size = sizeof(rt_node_leaf_256),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+ .slab_inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+ .slab_leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
},
};
@@ -477,6 +487,12 @@ typedef struct radix_tree_control
uint64 max_val;
uint64 num_keys;
+ /*
+ * Track the amount of memory used. The callers can ask for it
+ * with rt_memory_usage().
+ */
+ uint64 mem_used;
+
/* statistics */
#ifdef RT_DEBUG
int32 cnt[RT_SIZE_CLASS_COUNT];
@@ -1005,15 +1021,22 @@ static rt_node_ptr
rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
{
rt_node_ptr newnode;
+ Size size;
if (RadixTreeIsShared(tree))
{
dsa_pointer dp;
if (inner)
+ {
dp = dsa_allocate(tree->area, rt_size_class_info[size_class].inner_size);
+ size = rt_size_class_info[size_class].dsa_inner_size;
+ }
else
+ {
dp = dsa_allocate(tree->area, rt_size_class_info[size_class].leaf_size);
+ size = rt_size_class_info[size_class].dsa_leaf_size;
+ }
newnode.encoded = (rt_pointer) dp;
newnode.decoded = rt_pointer_decode(tree, newnode.encoded);
@@ -1028,8 +1051,12 @@ rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
rt_size_class_info[size_class].leaf_size);
newnode.encoded = rt_pointer_encode(newnode.decoded);
+ size = GetMemoryChunkSpace(newnode.decoded);
}
+ /* update memory usage */
+ tree->ctl->mem_used += size;
+
#ifdef RT_DEBUG
/* update the statistics */
tree->ctl->cnt[size_class]++;
@@ -1095,6 +1122,15 @@ rt_grow_node_kind(radix_tree *tree, rt_node_ptr node, uint8 new_kind)
static void
rt_free_node(radix_tree *tree, rt_node_ptr node)
{
+ int size;
+ static const int fanout_node_class[RT_NODE_MAX_SLOTS] =
+ {
+ [4] = RT_CLASS_4_FULL,
+ [15] = RT_CLASS_32_PARTIAL,
+ [32] = RT_CLASS_32_FULL,
+ [125] = RT_CLASS_125_FULL,
+ };
+
/* If we're deleting the root node, make the tree empty */
if (tree->ctl->root == node.encoded)
{
@@ -1104,28 +1140,38 @@ rt_free_node(radix_tree *tree, rt_node_ptr node)
#ifdef RT_DEBUG
{
- int i;
+ int size_class = (NODE_FANOUT(node) == 0)
+ ? RT_CLASS_256
+ : fanout_node_class[NODE_FANOUT(node)];
/* update the statistics */
- for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
- {
- if (NODE_FANOUT(node) == rt_size_class_info[i].fanout)
- break;
- }
-
- /* fanout of node256 is intentionally 0 */
- if (i == RT_SIZE_CLASS_COUNT)
- i = RT_CLASS_256;
-
- tree->ctl->cnt[i]--;
- Assert(tree->ctl->cnt[i] >= 0);
+ tree->ctl->cnt[size_class]--;
+ Assert(tree->ctl->cnt[size_class] >= 0);
}
#endif
if (RadixTreeIsShared(tree))
+ {
+ int size_class = (NODE_FANOUT(node) == 0)
+ ? RT_CLASS_256
+ : fanout_node_class[NODE_FANOUT(node)];
+
+ if (!NODE_IS_LEAF(node))
+ size = rt_size_class_info[size_class].dsa_inner_size;
+ else
+ size = rt_size_class_info[size_class].dsa_leaf_size;
+
dsa_free(tree->area, (dsa_pointer) node.encoded);
+ }
else
+ {
+ size = GetMemoryChunkSpace(node.decoded);
pfree(node.decoded);
+ }
+
+ /* update memory usage */
+ tree->ctl->mem_used -= size;
+ Assert(tree->ctl->mem_used > 0);
}
/*
@@ -1837,15 +1883,18 @@ rt_create(MemoryContext ctx, dsa_area *area)
dp = dsa_allocate0(area, sizeof(radix_tree_control));
tree->ctl = (radix_tree_control *) dsa_get_address(area, dp);
tree->ctl->handle = (rt_handle) dp;
+ tree->ctl->mem_used += dsa_get_size_class(sizeof(radix_tree_control));
}
else
{
tree->ctl = (radix_tree_control *) palloc0(sizeof(radix_tree_control));
tree->ctl->handle = InvalidDsaPointer;
+ tree->ctl->mem_used += GetMemoryChunkSpace(tree->ctl);
}
tree->ctl->magic = RADIXTREE_MAGIC;
tree->ctl->root = InvalidRTPointer;
+ tree->ctl->mem_used = GetMemoryChunkSpace(tree);
/* Create the slab allocator for each size class */
if (area == NULL)
@@ -1854,17 +1903,29 @@ rt_create(MemoryContext ctx, dsa_area *area)
{
tree->inner_slabs[i] = SlabContextCreate(ctx,
rt_size_class_info[i].name,
- rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].slab_inner_blocksize,
rt_size_class_info[i].inner_size);
tree->leaf_slabs[i] = SlabContextCreate(ctx,
rt_size_class_info[i].name,
- rt_size_class_info[i].leaf_blocksize,
+ rt_size_class_info[i].slab_leaf_blocksize,
rt_size_class_info[i].leaf_size);
#ifdef RT_DEBUG
tree->ctl->cnt[i] = 0;
#endif
}
}
+ else if (!rt_size_class_dsa_info_initialized)
+ {
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ rt_size_class_info[i].dsa_inner_size =
+ dsa_get_size_class(rt_size_class_info[i].inner_size);
+ rt_size_class_info[i].dsa_leaf_size =
+ dsa_get_size_class(rt_size_class_info[i].leaf_size);
+ }
+
+ rt_size_class_dsa_info_initialized = true;
+ }
MemoryContextSwitchTo(old_ctx);
@@ -2534,22 +2595,8 @@ rt_num_entries(radix_tree *tree)
uint64
rt_memory_usage(radix_tree *tree)
{
- Size total = sizeof(radix_tree) + sizeof(radix_tree_control);
-
Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
-
- if (RadixTreeIsShared(tree))
- total = dsa_get_total_size(tree->area);
- else
- {
- for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
- {
- total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
- total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
- }
- }
-
- return total;
+ return tree->ctl->mem_used;
}
/*
@@ -2873,9 +2920,9 @@ rt_dump(radix_tree *tree)
fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
rt_size_class_info[i].name,
rt_size_class_info[i].inner_size,
- rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].slab_inner_blocksize,
rt_size_class_info[i].leaf_size,
- rt_size_class_info[i].leaf_blocksize);
+ rt_size_class_info[i].slab_leaf_blocksize);
fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
if (!RTPointerIsValid(tree->ctl->root))
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index ad169882af..e77aea10e2 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1208,6 +1208,48 @@ dsa_minimum_size(void)
return pages * FPM_PAGE_SIZE;
}
+size_t
+dsa_get_size_class(size_t size)
+{
+ uint16 size_class;
+
+ if (size > dsa_size_classes[lengthof(dsa_size_classes) - 1])
+ return size;
+ else if (size < lengthof(dsa_size_class_map) * DSA_SIZE_CLASS_MAP_QUANTUM)
+ {
+ int mapidx;
+
+ /* For smaller sizes we have a lookup table... */
+ mapidx = ((size + DSA_SIZE_CLASS_MAP_QUANTUM - 1) /
+ DSA_SIZE_CLASS_MAP_QUANTUM) - 1;
+ size_class = dsa_size_class_map[mapidx];
+ }
+ else
+ {
+ uint16 min;
+ uint16 max;
+
+ /* ... and for the rest we search by binary chop. */
+ min = dsa_size_class_map[lengthof(dsa_size_class_map) - 1];
+ max = lengthof(dsa_size_classes) - 1;
+
+ while (min < max)
+ {
+ uint16 mid = (min + max) / 2;
+ uint16 class_size = dsa_size_classes[mid];
+
+ if (class_size < size)
+ min = mid + 1;
+ else
+ max = mid;
+ }
+
+ size_class = min;
+ }
+
+ return dsa_size_classes[size_class];
+}
+
/*
* Workhorse function for dsa_create and dsa_create_in_place.
*/
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index dad06adecc..a17c4eb88c 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -118,6 +118,7 @@ extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags)
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
extern size_t dsa_get_total_size(dsa_area *area);
+extern size_t dsa_get_size_class(size_t size);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
--
2.31.1
v15-0005-tool-for-measuring-radix-tree-performance.patchapplication/octet-stream; name=v15-0005-tool-for-measuring-radix-tree-performance.patchDownload
From 75af1182c7107486db3846e616625e456d640e3c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v15 5/9] tool for measuring radix tree performance
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 76 +++
contrib/bench_radix_tree/bench_radix_tree.c | 635 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
6 files changed, 767 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..83529805fc
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..a0693695e6
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,635 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 search_time_ms;
+ Datum values[2] = {0};
+ bool nulls[2] = {0};
+ /* from trial and error */
+ uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (!PG_ARGISNULL(1))
+ {
+ char *filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+ if (sscanf(filter_str, "0x%lX", &filter) == 0)
+ elog(ERROR, "invalid filter string %s", filter_str);
+ }
+ elog(NOTICE, "bench with filter 0x%lX", filter);
+
+ rt = rt_create(CurrentMemoryContext);
+
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+
+ rt_set(rt, key, key);
+ }
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_sparseload_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ uint64 r,
+ h;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ key_id = 0;
+
+ for (r = 1; r <= fanout; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (h);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+
+ rt_stats(rt);
+
+ /* measure sparse deletion and re-loading */
+ start_time = GetCurrentTimestamp();
+
+ for (int t = 0; t<10000; t++)
+ {
+ /* delete one key in each leaf */
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ rt_delete(rt, key);
+ }
+
+ /* add them all back */
+ key_id = 0;
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_sparseload_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
--
2.31.1
v15-0007-PoC-DSA-support-for-radix-tree.patchapplication/octet-stream; name=v15-0007-PoC-DSA-support-for-radix-tree.patchDownload
From d575b8f8215494d9ac82b256b260acd921de1928 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 16:42:55 +0700
Subject: [PATCH v15 7/9] PoC: DSA support for radix tree
---
.../bench_radix_tree--1.0.sql | 2 +
contrib/bench_radix_tree/bench_radix_tree.c | 16 +-
src/backend/lib/radixtree.c | 437 ++++++++++++++----
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 8 +-
src/include/utils/dsa.h | 1 +
.../expected/test_radixtree.out | 25 +
.../modules/test_radixtree/test_radixtree.c | 147 ++++--
8 files changed, 502 insertions(+), 146 deletions(-)
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 83529805fc..d9216d715c 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -7,6 +7,7 @@ create function bench_shuffle_search(
minblk int4,
maxblk int4,
random_block bool DEFAULT false,
+shared bool DEFAULT false,
OUT nkeys int8,
OUT rt_mem_allocated int8,
OUT array_mem_allocated int8,
@@ -23,6 +24,7 @@ create function bench_seq_search(
minblk int4,
maxblk int4,
random_block bool DEFAULT false,
+shared bool DEFAULT false,
OUT nkeys int8,
OUT rt_mem_allocated int8,
OUT array_mem_allocated int8,
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index a0693695e6..1a26722495 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -154,6 +154,8 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
BlockNumber maxblk = PG_GETARG_INT32(1);
bool random_block = PG_GETARG_BOOL(2);
radix_tree *rt = NULL;
+ bool shared = PG_GETARG_BOOL(3);
+ dsa_area *dsa = NULL;
uint64 ntids;
uint64 key;
uint64 last_key = PG_UINT64_MAX;
@@ -176,7 +178,11 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
/* measure the load time of the radix tree */
- rt = rt_create(CurrentMemoryContext);
+ if (shared)
+ dsa = dsa_create(LWLockNewTrancheId());
+ rt = rt_create(CurrentMemoryContext, dsa);
+
+ /* measure the load time of the radix tree */
start_time = GetCurrentTimestamp();
for (int i = 0; i < ntids; i++)
{
@@ -327,7 +333,7 @@ bench_load_random_int(PG_FUNCTION_ARGS)
elog(ERROR, "return type must be a row type");
pg_prng_seed(&state, 0);
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
start_time = GetCurrentTimestamp();
for (uint64 i = 0; i < cnt; i++)
@@ -393,7 +399,7 @@ bench_search_random_nodes(PG_FUNCTION_ARGS)
}
elog(NOTICE, "bench with filter 0x%lX", filter);
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
for (uint64 i = 0; i < cnt; i++)
{
@@ -462,7 +468,7 @@ bench_fixed_height_search(PG_FUNCTION_ARGS)
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
start_time = GetCurrentTimestamp();
@@ -574,7 +580,7 @@ bench_node128_load(PG_FUNCTION_ARGS)
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
- rt = rt_create(CurrentMemoryContext);
+ rt = rt_create(CurrentMemoryContext, NULL);
key_id = 0;
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index bff37a2c35..455071cbab 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -22,6 +22,15 @@
* choose it to avoid an additional pointer traversal. It is the reason this code
* currently does not support variable-length keys.
*
+ * If DSA area is specified for rt_create(), the radix tree is created in the
+ * DSA area so that multiple processes can access to it simultaneously. The process
+ * who created the shared radix tree needs to tell both DSA area specified when
+ * calling to rt_create() and dsa_pointer of the radix tree, fetched by
+ * rt_get_dsa_pointer(), to other processes so that they can attach by rt_attach().
+ *
+ * XXX: shared radix tree is still PoC state as it doesn't have any locking support.
+ * Also, it supports the iteration only by one process.
+ *
* XXX: Most functions in this file have two variants for inner nodes and leaf
* nodes, therefore there are duplication codes. While this sometimes makes the
* code maintenance tricky, this reduces branch prediction misses when judging
@@ -34,6 +43,9 @@
*
* rt_create - Create a new, empty radix tree
* rt_free - Free the radix tree
+ * rt_attach - Attach to the radix tree
+ * rt_detach - Detach from the radix tree
+ * rt_get_handle - Return the handle of the radix tree
* rt_search - Search a key-value pair
* rt_set - Set a key-value pair
* rt_delete - Delete a key-value pair
@@ -65,6 +77,7 @@
#include "nodes/bitmapset.h"
#include "port/pg_bitutils.h"
#include "port/pg_lfind.h"
+#include "utils/dsa.h"
#include "utils/memutils.h"
#ifdef RT_DEBUG
@@ -426,6 +439,10 @@ static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
* construct the key whenever updating the node iteration information, e.g., when
* advancing the current index within the node or when moving to the next node
* at the same level.
+ *
+ * XXX: We need either a safeguard to disallow other processes to begin the
+ * iteration while one process is doing or to allow multiple processes to do
+ * the iteration.
*/
typedef struct rt_node_iter
{
@@ -445,23 +462,43 @@ struct rt_iter
uint64 key;
};
-/* A radix tree with nodes */
-struct radix_tree
+/* A magic value used to identify our radix tree */
+#define RADIXTREE_MAGIC 0x54A48167
+
+/* Control information for an radix tree */
+typedef struct radix_tree_control
{
- MemoryContext context;
+ rt_handle handle;
+ uint32 magic;
+ /* Root node */
rt_pointer root;
+
uint64 max_val;
uint64 num_keys;
- MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
- MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
-
/* statistics */
#ifdef RT_DEBUG
int32 cnt[RT_SIZE_CLASS_COUNT];
#endif
+} radix_tree_control;
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ /* control object in either backend-local memory or DSA */
+ radix_tree_control *ctl;
+
+ /* used only when the radix tree is shared */
+ dsa_area *area;
+
+ /* used only when the radix tree is private */
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
};
+#define RadixTreeIsShared(rt) ((rt)->area != NULL)
static void rt_new_root(radix_tree *tree, uint64 key);
@@ -490,9 +527,12 @@ static void rt_verify_node(rt_node_ptr node);
/* Decode and encode functions of rt_pointer */
static inline rt_node *
-rt_pointer_decode(rt_pointer encoded)
+rt_pointer_decode(radix_tree *tree, rt_pointer encoded)
{
- return (rt_node *) encoded;
+ if (RadixTreeIsShared(tree))
+ return (rt_node *) dsa_get_address(tree->area, encoded);
+ else
+ return (rt_node *) encoded;
}
static inline rt_pointer
@@ -503,11 +543,11 @@ rt_pointer_encode(rt_node *decoded)
/* Return a rt_node_ptr created from the given encoded pointer */
static inline rt_node_ptr
-rt_node_ptr_encoded(rt_pointer encoded)
+rt_node_ptr_encoded(radix_tree *tree, rt_pointer encoded)
{
return (rt_node_ptr) {
.encoded = encoded,
- .decoded = rt_pointer_decode(encoded),
+ .decoded = rt_pointer_decode(tree, encoded)
};
}
@@ -954,8 +994,8 @@ rt_new_root(radix_tree *tree, uint64 key)
rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
NODE_SHIFT(newnode) = shift;
- tree->max_val = shift_get_max_val(shift);
- tree->root = newnode.encoded;
+ tree->ctl->max_val = shift_get_max_val(shift);
+ tree->ctl->root = newnode.encoded;
}
/*
@@ -964,20 +1004,35 @@ rt_new_root(radix_tree *tree, uint64 key)
static rt_node_ptr
rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
{
- rt_node_ptr newnode;
+ rt_node_ptr newnode;
- if (inner)
- newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
- rt_size_class_info[size_class].inner_size);
- else
- newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
- rt_size_class_info[size_class].leaf_size);
+ if (RadixTreeIsShared(tree))
+ {
+ dsa_pointer dp;
- newnode.encoded = rt_pointer_encode(newnode.decoded);
+ if (inner)
+ dp = dsa_allocate(tree->area, rt_size_class_info[size_class].inner_size);
+ else
+ dp = dsa_allocate(tree->area, rt_size_class_info[size_class].leaf_size);
+
+ newnode.encoded = (rt_pointer) dp;
+ newnode.decoded = rt_pointer_decode(tree, newnode.encoded);
+ }
+ else
+ {
+ if (inner)
+ newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+ rt_size_class_info[size_class].inner_size);
+ else
+ newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ rt_size_class_info[size_class].leaf_size);
+
+ newnode.encoded = rt_pointer_encode(newnode.decoded);
+ }
#ifdef RT_DEBUG
/* update the statistics */
- tree->cnt[size_class]++;
+ tree->ctl->cnt[size_class]++;
#endif
return newnode;
@@ -1041,10 +1096,10 @@ static void
rt_free_node(radix_tree *tree, rt_node_ptr node)
{
/* If we're deleting the root node, make the tree empty */
- if (tree->root == node.encoded)
+ if (tree->ctl->root == node.encoded)
{
- tree->root = InvalidRTPointer;
- tree->max_val = 0;
+ tree->ctl->root = InvalidRTPointer;
+ tree->ctl->max_val = 0;
}
#ifdef RT_DEBUG
@@ -1062,12 +1117,15 @@ rt_free_node(radix_tree *tree, rt_node_ptr node)
if (i == RT_SIZE_CLASS_COUNT)
i = RT_CLASS_256;
- tree->cnt[i]--;
- Assert(tree->cnt[i] >= 0);
+ tree->ctl->cnt[i]--;
+ Assert(tree->ctl->cnt[i] >= 0);
}
#endif
- pfree(node.decoded);
+ if (RadixTreeIsShared(tree))
+ dsa_free(tree->area, (dsa_pointer) node.encoded);
+ else
+ pfree(node.decoded);
}
/*
@@ -1083,7 +1141,7 @@ rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
if (rt_node_ptr_eq(&parent, &old_child))
{
/* Replace the root node with the new large node */
- tree->root = new_child.encoded;
+ tree->ctl->root = new_child.encoded;
}
else
{
@@ -1105,7 +1163,7 @@ static void
rt_extend(radix_tree *tree, uint64 key)
{
int target_shift;
- rt_node *root = rt_pointer_decode(tree->root);
+ rt_node *root = rt_pointer_decode(tree, tree->ctl->root);
int shift = root->shift + RT_NODE_SPAN;
target_shift = key_get_shift(key);
@@ -1123,15 +1181,15 @@ rt_extend(radix_tree *tree, uint64 key)
n4->base.n.shift = shift;
n4->base.n.count = 1;
n4->base.chunks[0] = 0;
- n4->children[0] = tree->root;
+ n4->children[0] = tree->ctl->root;
root->chunk = 0;
- tree->root = node.encoded;
+ tree->ctl->root = node.encoded;
shift += RT_NODE_SPAN;
}
- tree->max_val = shift_get_max_val(target_shift);
+ tree->ctl->max_val = shift_get_max_val(target_shift);
}
/*
@@ -1163,7 +1221,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
}
rt_node_insert_leaf(tree, parent, node, key, value);
- tree->num_keys++;
+ tree->ctl->num_keys++;
}
/*
@@ -1174,12 +1232,11 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
* pointer is set to child_p.
*/
static inline bool
-rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
- rt_pointer *child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action, rt_pointer *child_p)
{
uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool found = false;
- rt_pointer child;
+ rt_pointer child = InvalidRTPointer;
switch (NODE_KIND(node))
{
@@ -1210,6 +1267,7 @@ rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
break;
found = true;
+
if (action == RT_ACTION_FIND)
child = n32->children[idx];
else /* RT_ACTION_DELETE */
@@ -1761,33 +1819,51 @@ retry_insert_leaf_32:
* Create the radix tree in the given memory context and return it.
*/
radix_tree *
-rt_create(MemoryContext ctx)
+rt_create(MemoryContext ctx, dsa_area *area)
{
radix_tree *tree;
MemoryContext old_ctx;
old_ctx = MemoryContextSwitchTo(ctx);
- tree = palloc(sizeof(radix_tree));
+ tree = (radix_tree *) palloc0(sizeof(radix_tree));
tree->context = ctx;
- tree->root = InvalidRTPointer;
- tree->max_val = 0;
- tree->num_keys = 0;
+
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+
+ tree->area = area;
+ dp = dsa_allocate0(area, sizeof(radix_tree_control));
+ tree->ctl = (radix_tree_control *) dsa_get_address(area, dp);
+ tree->ctl->handle = (rt_handle) dp;
+ }
+ else
+ {
+ tree->ctl = (radix_tree_control *) palloc0(sizeof(radix_tree_control));
+ tree->ctl->handle = InvalidDsaPointer;
+ }
+
+ tree->ctl->magic = RADIXTREE_MAGIC;
+ tree->ctl->root = InvalidRTPointer;
/* Create the slab allocator for each size class */
- for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ if (area == NULL)
{
- tree->inner_slabs[i] = SlabContextCreate(ctx,
- rt_size_class_info[i].name,
- rt_size_class_info[i].inner_blocksize,
- rt_size_class_info[i].inner_size);
- tree->leaf_slabs[i] = SlabContextCreate(ctx,
- rt_size_class_info[i].name,
- rt_size_class_info[i].leaf_blocksize,
- rt_size_class_info[i].leaf_size);
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].leaf_blocksize,
+ rt_size_class_info[i].leaf_size);
#ifdef RT_DEBUG
- tree->cnt[i] = 0;
+ tree->ctl->cnt[i] = 0;
#endif
+ }
}
MemoryContextSwitchTo(old_ctx);
@@ -1795,16 +1871,163 @@ rt_create(MemoryContext ctx)
return tree;
}
+/*
+ * Get a handle that can be used by other processes to attach to this radix
+ * tree.
+ */
+dsa_pointer
+rt_get_handle(radix_tree *tree)
+{
+ Assert(RadixTreeIsShared(tree));
+ Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+ return tree->ctl->handle;
+}
+
+/*
+ * Attach to an existing radix tree using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+radix_tree *
+rt_attach(dsa_area *area, rt_handle handle)
+{
+ radix_tree *tree;
+ dsa_pointer control;
+
+ /* Allocate the backend-local object representing the radix tree */
+ tree = (radix_tree *) palloc0(sizeof(radix_tree));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ /* Set up the local radix tree */
+ tree->area = area;
+ tree->ctl = (radix_tree_control *) dsa_get_address(area, control);
+ Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+ return tree;
+}
+
+/*
+ * Detach from a radix tree. This frees backend-local resources associated
+ * with the radix tree, but the radix tree will continue to exist until
+ * it is explicitly freed.
+ */
+void
+rt_detach(radix_tree *tree)
+{
+ Assert(RadixTreeIsShared(tree));
+ Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+ pfree(tree);
+}
+
+/*
+ * Recursively free all nodes allocated to the dsa area.
+ */
+static void
+rt_free_recurse(radix_tree *tree, rt_pointer ptr)
+{
+ rt_node_ptr node = rt_node_ptr_encoded(tree, ptr);
+
+ Assert(RadixTreeIsShared(tree));
+
+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();
+
+ /* The leaf node doesn't have child pointers, so free it */
+ if (NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->area, (dsa_pointer) node.encoded);
+ return;
+ }
+
+ switch (NODE_KIND(node))
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < NODE_COUNT(node); i++)
+ rt_free_recurse(tree, n4->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < NODE_COUNT(node); i++)
+ rt_free_recurse(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ continue;
+
+ rt_free_recurse(tree, node_inner_125_get_child(n125, i));
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
+
+ /* Free all children recursively */
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_inner_256_is_chunk_used(n256, i))
+ continue;
+
+ rt_free_recurse(tree, node_inner_256_get_child(n256, i));
+ }
+ break;
+ }
+ }
+
+ /* Free the inner node itself */
+ dsa_free(tree->area, node.encoded);
+}
+
/*
* Free the given radix tree.
*/
void
rt_free(radix_tree *tree)
{
- for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+ if (RadixTreeIsShared(tree))
{
- MemoryContextDelete(tree->inner_slabs[i]);
- MemoryContextDelete(tree->leaf_slabs[i]);
+ /* Free all memory used for radix tree nodes */
+ if (RTPointerIsValid(tree->ctl->root))
+ rt_free_recurse(tree, tree->ctl->root);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->area, tree->ctl->handle);
+ }
+ else
+ {
+ /* Free all memory used for radix tree nodes */
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+ pfree(tree->ctl);
}
pfree(tree);
@@ -1822,16 +2045,18 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
rt_node_ptr node;
rt_node_ptr parent;
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
/* Empty tree, create the root */
- if (!RTPointerIsValid(tree->root))
+ if (!RTPointerIsValid(tree->ctl->root))
rt_new_root(tree, key);
/* Extend the tree if necessary */
- if (key > tree->max_val)
+ if (key > tree->ctl->max_val)
rt_extend(tree, key);
/* Descend the tree until a leaf node */
- node = parent = rt_node_ptr_encoded(tree->root);
+ node = parent = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
while (shift >= 0)
{
@@ -1847,7 +2072,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
}
parent = node;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
}
@@ -1855,7 +2080,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
/* Update the statistics */
if (!updated)
- tree->num_keys++;
+ tree->ctl->num_keys++;
return updated;
}
@@ -1871,12 +2096,13 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
rt_node_ptr node;
int shift;
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
Assert(value_p != NULL);
- if (!RTPointerIsValid(tree->root) || key > tree->max_val)
+ if (!RTPointerIsValid(tree->ctl->root) || key > tree->ctl->max_val)
return false;
- node = rt_node_ptr_encoded(tree->root);
+ node = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
/* Descend the tree until a leaf node */
@@ -1890,7 +2116,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
}
@@ -1910,14 +2136,16 @@ rt_delete(radix_tree *tree, uint64 key)
int level;
bool deleted;
- if (!tree->root || key > tree->max_val)
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+ if (!RTPointerIsValid(tree->ctl->root) || key > tree->ctl->max_val)
return false;
/*
* Descend the tree to search the key while building a stack of nodes we
* visited.
*/
- node = rt_node_ptr_encoded(tree->root);
+ node = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
level = -1;
while (shift > 0)
@@ -1930,7 +2158,7 @@ rt_delete(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
}
@@ -1945,7 +2173,7 @@ rt_delete(radix_tree *tree, uint64 key)
}
/* Found the key to delete. Update the statistics */
- tree->num_keys--;
+ tree->ctl->num_keys--;
/*
* Return if the leaf node still has keys and we don't need to delete the
@@ -1985,16 +2213,18 @@ rt_begin_iterate(radix_tree *tree)
rt_iter *iter;
int top_level;
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
old_ctx = MemoryContextSwitchTo(tree->context);
iter = (rt_iter *) palloc0(sizeof(rt_iter));
iter->tree = tree;
/* empty tree */
- if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->root))
+ if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->ctl->root))
return iter;
- root = rt_node_ptr_encoded(iter->tree->root);
+ root = rt_node_ptr_encoded(tree, iter->tree->ctl->root);
top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
iter->stack_len = top_level;
@@ -2045,8 +2275,10 @@ rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
bool
rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
{
+ Assert(!RadixTreeIsShared(iter->tree) || iter->tree->ctl->magic == RADIXTREE_MAGIC);
+
/* Empty tree */
- if (!iter->tree->root)
+ if (!iter->tree->ctl->root)
return false;
for (;;)
@@ -2190,7 +2422,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *
if (found)
{
rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
- *child_p = rt_node_ptr_encoded(child);
+ *child_p = rt_node_ptr_encoded(iter->tree, child);
}
return found;
@@ -2293,7 +2525,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_
uint64
rt_num_entries(radix_tree *tree)
{
- return tree->num_keys;
+ return tree->ctl->num_keys;
}
/*
@@ -2302,12 +2534,19 @@ rt_num_entries(radix_tree *tree)
uint64
rt_memory_usage(radix_tree *tree)
{
- Size total = sizeof(radix_tree);
+ Size total = sizeof(radix_tree) + sizeof(radix_tree_control);
- for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+ if (RadixTreeIsShared(tree))
+ total = dsa_get_total_size(tree->area);
+ else
{
- total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
- total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
}
return total;
@@ -2391,23 +2630,23 @@ rt_verify_node(rt_node_ptr node)
void
rt_stats(radix_tree *tree)
{
- rt_node *root = rt_pointer_decode(tree->root);
+ rt_node *root = rt_pointer_decode(tree, tree->ctl->root);
if (root == NULL)
return;
ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
- tree->num_keys,
+ tree->ctl->num_keys,
root->shift / RT_NODE_SPAN,
- tree->cnt[RT_CLASS_4_FULL],
- tree->cnt[RT_CLASS_32_PARTIAL],
- tree->cnt[RT_CLASS_32_FULL],
- tree->cnt[RT_CLASS_125_FULL],
- tree->cnt[RT_CLASS_256])));
+ tree->ctl->cnt[RT_CLASS_4_FULL],
+ tree->ctl->cnt[RT_CLASS_32_PARTIAL],
+ tree->ctl->cnt[RT_CLASS_32_FULL],
+ tree->ctl->cnt[RT_CLASS_125_FULL],
+ tree->ctl->cnt[RT_CLASS_256])));
}
static void
-rt_dump_node(rt_node_ptr node, int level, bool recurse)
+rt_dump_node(radix_tree *tree, rt_node_ptr node, int level, bool recurse)
{
rt_node *n = node.decoded;
char space[128] = {0};
@@ -2445,7 +2684,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
space, n4->base.chunks[i]);
if (recurse)
- rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+ rt_dump_node(tree, rt_node_ptr_encoded(tree, n4->children[i]),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2473,7 +2712,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
if (recurse)
{
- rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+ rt_dump_node(tree, rt_node_ptr_encoded(tree, n32->children[i]),
level + 1, recurse);
}
else
@@ -2526,7 +2765,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(rt_node_ptr_encoded(node_inner_125_get_child(n125, i)),
+ rt_dump_node(tree,
+ rt_node_ptr_encoded(tree,
+ node_inner_125_get_child(n125, i)),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2559,7 +2800,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+ rt_dump_node(tree,
+ rt_node_ptr_encoded(tree,
+ node_inner_256_get_child(n256, i)),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2579,28 +2822,28 @@ rt_dump_search(radix_tree *tree, uint64 key)
elog(NOTICE, "-----------------------------------------------------------");
elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
- tree->max_val, tree->max_val);
+ tree->ctl->max_val, tree->ctl->max_val);
- if (!RTPointerIsValid(tree->root))
+ if (!RTPointerIsValid(tree->ctl->root))
{
elog(NOTICE, "tree is empty");
return;
}
- if (key > tree->max_val)
+ if (key > tree->ctl->max_val)
{
elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
key, key);
return;
}
- node = rt_node_ptr_encoded(tree->root);
+ node = rt_node_ptr_encoded(tree, tree->ctl->root);
shift = NODE_SHIFT(node);
while (shift >= 0)
{
rt_pointer child;
- rt_dump_node(node, level, false);
+ rt_dump_node(tree, node, level, false);
if (NODE_IS_LEAF(node))
{
@@ -2615,7 +2858,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
break;
- node = rt_node_ptr_encoded(child);
+ node = rt_node_ptr_encoded(tree, child);
shift -= RT_NODE_SPAN;
level++;
}
@@ -2633,15 +2876,15 @@ rt_dump(radix_tree *tree)
rt_size_class_info[i].inner_blocksize,
rt_size_class_info[i].leaf_size,
rt_size_class_info[i].leaf_blocksize);
- fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
- if (!RTPointerIsValid(tree->root))
+ if (!RTPointerIsValid(tree->ctl->root))
{
fprintf(stderr, "empty tree\n");
return;
}
- root = rt_node_ptr_encoded(tree->root);
- rt_dump_node(root, 0, true);
+ root = rt_node_ptr_encoded(tree, tree->ctl->root);
+ rt_dump_node(tree, root, 0, true);
}
#endif
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 82376fde2d..ad169882af 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d5d7668617..68a11df970 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -14,18 +14,24 @@
#define RADIXTREE_H
#include "postgres.h"
+#include "utils/dsa.h"
#define RT_DEBUG 1
typedef struct radix_tree radix_tree;
typedef struct rt_iter rt_iter;
+typedef dsa_pointer rt_handle;
-extern radix_tree *rt_create(MemoryContext ctx);
+extern radix_tree *rt_create(MemoryContext ctx, dsa_area *dsa);
extern void rt_free(radix_tree *tree);
extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
extern rt_iter *rt_begin_iterate(radix_tree *tree);
+extern rt_handle rt_get_handle(radix_tree *tree);
+extern radix_tree *rt_attach(dsa_area *dsa, dsa_pointer dp);
+extern void rt_detach(radix_tree *tree);
+
extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
extern void rt_end_iterate(rt_iter *iter);
extern bool rt_delete(radix_tree *tree, uint64 key);
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 405606fe2f..dad06adecc 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index ce645cb8b5..a217e0d312 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -6,28 +6,53 @@ CREATE EXTENSION test_radixtree;
SELECT test_radixtree();
NOTICE: testing basic operations with leaf node 4
NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
NOTICE: testing basic operations with leaf node 32
NOTICE: testing basic operations with inner node 32
NOTICE: testing basic operations with leaf node 125
NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
NOTICE: testing basic operations with leaf node 256
NOTICE: testing basic operations with inner node 256
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
NOTICE: testing radix tree node types with shift "0"
NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "8"
NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
NOTICE: testing radix tree node types with shift "24"
NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "32"
NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
NOTICE: testing radix tree node types with shift "48"
NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree node types with shift "56"
NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
NOTICE: testing radix tree with pattern "alternating bits"
NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of ten"
NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
NOTICE: testing radix tree with pattern "one-every-64k"
NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "sparse"
NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
test_radixtree
----------------
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index ea993e63df..fe1e168ec4 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -19,6 +19,7 @@
#include "nodes/bitmapset.h"
#include "storage/block.h"
#include "storage/itemptr.h"
+#include "storage/lwlock.h"
#include "utils/memutils.h"
#include "utils/timestamp.h"
@@ -99,6 +100,8 @@ static const test_spec test_specs[] = {
}
};
+static int lwlock_tranche_id;
+
PG_MODULE_MAGIC;
PG_FUNCTION_INFO_V1(test_radixtree);
@@ -112,7 +115,7 @@ test_empty(void)
uint64 key;
uint64 val;
- radixtree = rt_create(CurrentMemoryContext);
+ radixtree = rt_create(CurrentMemoryContext, NULL);
if (rt_search(radixtree, 0, &dummy))
elog(ERROR, "rt_search on empty tree returned true");
@@ -140,17 +143,14 @@ test_empty(void)
}
static void
-test_basic(int children, bool test_inner)
+do_test_basic(radix_tree *radixtree, int children, bool test_inner)
{
- radix_tree *radixtree;
uint64 *keys;
int shift = test_inner ? 8 : 0;
elog(NOTICE, "testing basic operations with %s node %d",
test_inner ? "inner" : "leaf", children);
- radixtree = rt_create(CurrentMemoryContext);
-
/* prepare keys in order like 1, 32, 2, 31, 2, ... */
keys = palloc(sizeof(uint64) * children);
for (int i = 0; i < children; i++)
@@ -165,7 +165,7 @@ test_basic(int children, bool test_inner)
for (int i = 0; i < children; i++)
{
if (rt_set(radixtree, keys[i], keys[i]))
- elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found %d", keys[i], i);
}
/* update keys */
@@ -185,7 +185,38 @@ test_basic(int children, bool test_inner)
}
pfree(keys);
- rt_free(radixtree);
+}
+
+static void
+test_basic()
+{
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ radix_tree *tree;
+ dsa_area *area;
+
+ /* Test the local radix tree */
+ tree = rt_create(CurrentMemoryContext, NULL);
+ do_test_basic(tree, rt_node_kind_fanouts[i], false);
+ rt_free(tree);
+
+ tree = rt_create(CurrentMemoryContext, NULL);
+ do_test_basic(tree, rt_node_kind_fanouts[i], true);
+ rt_free(tree);
+
+ /* Test the shared radix tree */
+ area = dsa_create(lwlock_tranche_id);
+ tree = rt_create(CurrentMemoryContext, area);
+ do_test_basic(tree, rt_node_kind_fanouts[i], false);
+ rt_free(tree);
+ dsa_detach(area);
+
+ area = dsa_create(lwlock_tranche_id);
+ tree = rt_create(CurrentMemoryContext, area);
+ do_test_basic(tree, rt_node_kind_fanouts[i], true);
+ rt_free(tree);
+ dsa_detach(area);
+ }
}
/*
@@ -286,14 +317,10 @@ test_node_types_delete(radix_tree *radixtree, uint8 shift)
* level.
*/
static void
-test_node_types(uint8 shift)
+do_test_node_types(radix_tree *radixtree, uint8 shift)
{
- radix_tree *radixtree;
-
elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
- radixtree = rt_create(CurrentMemoryContext);
-
/*
* Insert and search entries for every node type at the 'shift' level,
* then delete all entries to make it empty, and insert and search entries
@@ -302,19 +329,37 @@ test_node_types(uint8 shift)
test_node_types_insert(radixtree, shift, true);
test_node_types_delete(radixtree, shift);
test_node_types_insert(radixtree, shift, false);
+}
- rt_free(radixtree);
+static void
+test_node_types(void)
+{
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ {
+ radix_tree *tree;
+ dsa_area *area;
+
+ /* Test the local radix tree */
+ tree = rt_create(CurrentMemoryContext, NULL);
+ do_test_node_types(tree, shift);
+ rt_free(tree);
+
+ /* Test the shared radix tree */
+ area = dsa_create(lwlock_tranche_id);
+ tree = rt_create(CurrentMemoryContext, area);
+ do_test_node_types(tree, shift);
+ rt_free(tree);
+ dsa_detach(area);
+ }
}
/*
* Test with a repeating pattern, defined by the 'spec'.
*/
static void
-test_pattern(const test_spec * spec)
+do_test_pattern(radix_tree *radixtree, const test_spec * spec)
{
- radix_tree *radixtree;
rt_iter *iter;
- MemoryContext radixtree_ctx;
TimestampTz starttime;
TimestampTz endtime;
uint64 n;
@@ -340,18 +385,6 @@ test_pattern(const test_spec * spec)
pattern_values[pattern_num_values++] = i;
}
- /*
- * Allocate the radix tree.
- *
- * Allocate it in a separate memory context, so that we can print its
- * memory usage easily.
- */
- radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
- "radixtree test",
- ALLOCSET_SMALL_SIZES);
- MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
- radixtree = rt_create(radixtree_ctx);
-
/*
* Add values to the set.
*/
@@ -405,8 +438,6 @@ test_pattern(const test_spec * spec)
mem_usage = rt_memory_usage(radixtree);
fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
mem_usage, (double) mem_usage / spec->num_values);
-
- MemoryContextStats(radixtree_ctx);
}
/* Check that rt_num_entries works */
@@ -555,27 +586,57 @@ test_pattern(const test_spec * spec)
if ((nbefore - ndeleted) != nafter)
elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
nafter, (nbefore - ndeleted), ndeleted);
+}
+
+static void
+test_patterns(void)
+{
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ {
+ radix_tree *tree;
+ MemoryContext radixtree_ctx;
+ dsa_area *area;
+ const test_spec *spec = &test_specs[i];
- MemoryContextDelete(radixtree_ctx);
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+ /* Test the local radix tree */
+ tree = rt_create(radixtree_ctx, NULL);
+ do_test_pattern(tree, spec);
+ rt_free(tree);
+ MemoryContextReset(radixtree_ctx);
+
+ /* Test the shared radix tree */
+ area = dsa_create(lwlock_tranche_id);
+ tree = rt_create(radixtree_ctx, area);
+ do_test_pattern(tree, spec);
+ rt_free(tree);
+ dsa_detach(area);
+ MemoryContextDelete(radixtree_ctx);
+ }
}
Datum
test_radixtree(PG_FUNCTION_ARGS)
{
- test_empty();
+ /* get a new lwlock tranche id for all tests for shared radix tree */
+ lwlock_tranche_id = LWLockNewTrancheId();
- for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
- {
- test_basic(rt_node_kind_fanouts[i], false);
- test_basic(rt_node_kind_fanouts[i], true);
- }
-
- for (int shift = 0; shift <= (64 - 8); shift += 8)
- test_node_types(shift);
+ test_empty();
+ test_basic();
- /* Test different test patterns, with lots of entries */
- for (int i = 0; i < lengthof(test_specs); i++)
- test_pattern(&test_specs[i]);
+ test_node_types();
+ test_patterns();
PG_RETURN_VOID();
}
--
2.31.1
v15-0009-PoC-lazy-vacuum-integration.patchapplication/octet-stream; name=v15-0009-PoC-lazy-vacuum-integration.patchDownload
From 1ce76ec8644e7ce8ca1eb021c7e327f1afc11070 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 4 Nov 2022 14:14:42 +0900
Subject: [PATCH v15 9/9] PoC: lazy vacuum integration.
The patch includes:
* Introducing a new module, TIDStore, to store TID in radix tree.
* Integrating TIDStore with Lazy (parallel) vacuum.
---
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 555 ++++++++++++++++++++++++++
src/backend/access/heap/vacuumlazy.c | 171 +++-----
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 76 +---
src/backend/commands/vacuumparallel.c | 64 +--
src/backend/storage/lmgr/lwlock.c | 2 +
src/include/access/tidstore.h | 50 +++
src/include/commands/progress.h | 4 +-
src/include/commands/vacuum.h | 25 +-
src/include/storage/lwlock.h | 1 +
src/test/regress/expected/rules.out | 4 +-
13 files changed, 721 insertions(+), 235 deletions(-)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index 857beaa32d..76265974b1 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -13,6 +13,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..7e6fc4eeca
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,555 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * TID (ItemPointer) storage implementation.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "lib/radixtree.h"
+#include "port/pg_bitutils.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+#include "miscadmin.h"
+
+/* XXX only testing purpose during development, will be removed */
+#define XXX_DEBUG_TID_STORE 1
+
+/*
+ * For encoding purposes, item pointers are represented as a pair of 64-bit
+ * key and 64-bit value. We construct 64-bit unsigned integer that combines
+ * the block number and the offset number. The lowest 11 bits represent the
+ * offset number, and the next 32 bits are block number. That is, only 43
+ * bits are used:
+ *
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ *
+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with
+ * the bits, because smaller keys could help keeping the radix tree shallow.
+ *
+ * XXX: If we want to support other table AMs that want to use the full range
+ * of possible offset numbers, we'll need to change this.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits, and
+ * the rest 37 bits are used as the key:
+ *
+ * value = bitmap representation of XXXXXX
+ * key = XXXXXYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYuu
+ *
+ * The maximum height of the radix tree is 5. The most memory consuming case
+ * while adding TIDs is that we allocate a largest node in a new slab block,
+ * about 70kB. Therefore we deduct 70kB from the maximum memory.
+ */
+#define TIDSTORE_OFFSET_NBITS 11
+#define TIDSTORE_VALUE_NBITS 6 /* log(sizeof(uint64) * BITS_PER_BYTE, 2) */
+#define TIDSTORE_MEMORY_DEDUCT (1024 * 70)
+
+/* Get block number from the key */
+#define KEY_GET_BLKNO(key) \
+ ((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+struct TIDStore
+{
+ /* main storage for TID */
+ radix_tree *tree;
+
+ /* # of tids in TIDStore */
+ int num_tids;
+
+ /* maximum bytes TIDStore can consume */
+ uint64 max_bytes;
+
+ /* DSA area and handle for shared TIDStore */
+ rt_handle handle;
+ dsa_area *area;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ uint64 max_items;
+ ItemPointer itemptrs;
+ uint64 nitems;
+#endif
+};
+
+/* Iterator for TDIStore */
+typedef struct TIDStoreIter
+{
+ TIDStore *ts;
+
+ /* iterator of radix tree */
+ rt_iter *tree_iter;
+
+ /* we returned all tids? */
+ bool finished;
+
+ /* save for the next iteration */
+ uint64 next_key;
+ uint64 next_val;
+
+ /* output for the caller */
+ TIDStoreIterResult result;
+
+#ifdef USE_ASSERT_CHECKING
+ uint64 itemptrs_index;
+ int prev_index;
+#endif
+} TIDStoreIter;
+
+static void tidstore_iter_extract_tids(TIDStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+/*
+ * Comparator routines for use with qsort() and bsearch().
+ */
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+
+static void
+verify_iter_tids(TIDStoreIter *iter)
+{
+ uint64 index = iter->prev_index;
+ TIDStoreIterResult *result = &(iter->result);
+
+ if (iter->ts->itemptrs == NULL)
+ return;
+
+ Assert(index <= iter->ts->nitems);
+
+ for (int i = 0; i < result->num_offsets; i++)
+ {
+ ItemPointerData tid;
+
+ ItemPointerSetBlockNumber(&tid, result->blkno);
+ ItemPointerSetOffsetNumber(&tid, result->offsets[i]);
+
+ Assert(ItemPointerEquals(&iter->ts->itemptrs[index++], &tid));
+ }
+
+ iter->prev_index = iter->itemptrs_index;
+}
+
+static void
+dump_itemptrs(TIDStore *ts)
+{
+ StringInfoData buf;
+
+ if (ts->itemptrs == NULL)
+ return;
+
+ initStringInfo(&buf);
+ for (int i = 0; i < ts->nitems; i++)
+ {
+ appendStringInfo(&buf, "(%d,%d) ",
+ ItemPointerGetBlockNumber(&(ts->itemptrs[i])),
+ ItemPointerGetOffsetNumber(&(ts->itemptrs[i])));
+ }
+ elog(WARNING, "--- dump (" UINT64_FORMAT " items) ---", ts->nitems);
+ elog(WARNING, "%s\n", buf.data);
+}
+
+#endif
+
+/*
+ * Create a TIDStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TIDStore *
+tidstore_create(uint64 max_bytes, dsa_area *area)
+{
+ TIDStore *ts;
+
+ ts = palloc0(sizeof(TIDStore));
+
+ ts->tree = rt_create(CurrentMemoryContext, area);
+ ts->area = area;
+ ts->max_bytes = max_bytes - TIDSTORE_MEMORY_DEDUCT;
+
+ if (area != NULL)
+ ts->handle = rt_get_handle(ts->tree);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+#define MAXDEADITEMS(avail_mem) \
+ (avail_mem / sizeof(ItemPointerData))
+
+ if (area == NULL)
+ {
+ ts->max_items = MAXDEADITEMS(maintenance_work_mem * 1024);
+ ts->itemptrs = (ItemPointer) palloc0(sizeof(ItemPointerData) * ts->max_items);
+ ts->nitems = 0;
+ }
+#endif
+
+ return ts;
+}
+
+/* Attach to the shared TIDStore using a handle */
+TIDStore *
+tidstore_attach(dsa_area *area, rt_handle handle)
+{
+ TIDStore *ts;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ ts = palloc0(sizeof(TIDStore));
+ ts->tree = rt_attach(area, handle);
+
+ return ts;
+}
+
+/*
+ * Detach from a TIDStore. This detaches from radix tree and frees the
+ * backend-local resources.
+ */
+void
+tidstore_detach(TIDStore *ts)
+{
+ rt_detach(ts->tree);
+ pfree(ts);
+}
+
+void
+tidstore_free(TIDStore *ts)
+{
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ if (ts->itemptrs)
+ pfree(ts->itemptrs);
+#endif
+
+ rt_free(ts->tree);
+ pfree(ts);
+}
+
+/* Remove all collected tids but not free the TIDStore */
+void
+tidstore_reset(TIDStore *ts)
+{
+ dsa_area *area = ts->area;
+
+ /* Recreate the radix tree */
+ rt_free(ts->tree);
+
+ /* Return allocated DSM segments to the operating system */
+ if (ts->area)
+ dsa_trim(area);
+
+ ts->tree = rt_create(CurrentMemoryContext, area);
+
+ /* Reset the statistics */
+ ts->num_tids = 0;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ ts->nitems = 0;
+#endif
+}
+
+/* Add TIDs to TIDStore */
+void
+tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 key;
+ uint64 val = 0;
+ ItemPointerData tid;
+
+ ItemPointerSetBlockNumber(&tid, blkno);
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint32 off;
+
+ ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+ key = tid_to_key_off(&tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(ts->tree, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= UINT64CONST(1) << off;
+ ts->num_tids++;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ if (ts->itemptrs)
+ {
+ if (ts->nitems >= ts->max_items)
+ {
+ ts->max_items *= 2;
+ ts->itemptrs = repalloc(ts->itemptrs, sizeof(ItemPointerData) * ts->max_items);
+ }
+
+ Assert(ts->nitems < ts->max_items);
+ ItemPointerSetBlockNumber(&(ts->itemptrs[ts->nitems]), blkno);
+ ItemPointerSetOffsetNumber(&(ts->itemptrs[ts->nitems]), offsets[i]);
+ ts->nitems++;
+ }
+#endif
+ }
+
+ if (last_key != PG_UINT64_MAX)
+ {
+ rt_set(ts->tree, last_key, val);
+ val = 0;
+ }
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ if (ts->itemptrs)
+ Assert(ts->nitems == ts->num_tids);
+#endif
+}
+
+/* Return true if the given TID is present in TIDStore */
+bool
+tidstore_lookup_tid(TIDStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val;
+ uint32 off;
+ bool found;
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ bool found_assert;
+#endif
+
+ key = tid_to_key_off(tid, &off);
+
+ found = rt_search(ts->tree, key, &val);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ if (ts->itemptrs)
+ found_assert = bsearch((void *) tid,
+ (void *) ts->itemptrs,
+ ts->nitems,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr) != NULL;
+#endif
+
+ if (!found)
+ {
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ if (ts->itemptrs)
+ Assert(!found_assert);
+#endif
+ return false;
+ }
+
+ found = (val & (UINT64CONST(1) << off)) != 0;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+
+ if (ts->itemptrs && found != found_assert)
+ {
+ elog(WARNING, "tid (%d,%d)\n",
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
+ dump_itemptrs(ts);
+ }
+
+ if (ts->itemptrs)
+ Assert(found == found_assert);
+
+#endif
+ return found;
+}
+
+/*
+ * Prepare to iterate through a TIDStore. Return the TIDStoreIter allocated
+ * in the caller's memory context.
+ */
+TIDStoreIter *
+tidstore_begin_iterate(TIDStore *ts)
+{
+ TIDStoreIter *iter;
+
+ iter = palloc0(sizeof(TIDStoreIter));
+ iter->ts = ts;
+ iter->tree_iter = rt_begin_iterate(ts->tree);
+ iter->result.blkno = InvalidBlockNumber;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ iter->itemptrs_index = 0;
+#endif
+
+ return iter;
+}
+
+/*
+ * Scan the TIDStore and return a TIDStoreIterResult representing TIDs
+ * in one page. Offset numbers in the result is sorted.
+ */
+TIDStoreIterResult *
+tidstore_iterate_next(TIDStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+ TIDStoreIterResult *result = &(iter->result);
+
+ if (iter->finished)
+ return NULL;
+
+ if (BlockNumberIsValid(result->blkno))
+ {
+ result->num_offsets = 0;
+ tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (rt_iterate_next(iter->tree_iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = KEY_GET_BLKNO(key);
+
+ if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+ {
+ /*
+ * Remember the key-value pair for the next block for the
+ * next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ verify_iter_tids(iter);
+#endif
+ return result;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_extract_tids(iter, key, val);
+ }
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ verify_iter_tids(iter);
+#endif
+
+ iter->finished = true;
+ return result;
+}
+
+/* Finish an iteration over TIDStore */
+void
+tidstore_end_iterate(TIDStoreIter *iter)
+{
+ pfree(iter);
+}
+
+uint64
+tidstore_num_tids(TIDStore *ts)
+{
+ return ts->num_tids;
+}
+
+bool
+tidstore_is_full(TIDStore *ts)
+{
+ return (tidstore_memory_usage(ts) > ts->max_bytes);
+}
+
+uint64
+tidstore_max_memory(TIDStore *ts)
+{
+ return ts->max_bytes;
+}
+
+uint64
+tidstore_memory_usage(TIDStore *ts)
+{
+ return (uint64) sizeof(TIDStore) + rt_memory_usage(ts->tree);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TIDStore
+ */
+tidstore_handle
+tidstore_get_handle(TIDStore *ts)
+{
+ return rt_get_handle(ts->tree);
+}
+
+/* Extract TIDs from key-value pair */
+static void
+tidstore_iter_extract_tids(TIDStoreIter *iter, uint64 key, uint64 val)
+{
+ TIDStoreIterResult *result = (&iter->result);
+
+ for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ if ((val & (UINT64CONST(1) << i)) == 0)
+ continue;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= i;
+
+ off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+ result->offsets[result->num_offsets++] = off;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+ iter->itemptrs_index++;
+#endif
+ }
+
+ result->blkno = KEY_GET_BLKNO(key);
+}
+
+/*
+ * Encode a TID to key and val.
+ */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint64 tid_i;
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+ *off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+ upper = tid_i >> TIDSTORE_VALUE_NBITS;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ return upper;
+}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d59711b7ec..40082a6db0 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -194,7 +195,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TIDStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -265,8 +266,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer *vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer *vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -853,21 +855,21 @@ lazy_scan_heap(LVRelState *vacrel)
next_unskippable_block,
next_failsafe_block = 0,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TIDStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
const int initprog_index[] = {
PROGRESS_VACUUM_PHASE,
PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
- PROGRESS_VACUUM_MAX_DEAD_TUPLES
+ PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
};
int64 initprog_val[3];
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -937,8 +939,8 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ /* XXX: should not allow tidstore to grow beyond max_bytes */
+ if (tidstore_is_full(vacrel->dead_items))
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -1070,11 +1072,18 @@ lazy_scan_heap(LVRelState *vacrel)
if (prunestate.has_lpdead_items)
{
Size freespace;
+ TIDStoreIter *iter;
+ TIDStoreIterResult *result;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ result = tidstore_iterate_next(iter);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+ buf, &vmbuffer);
+ Assert(!tidstore_iterate_next(iter));
+ tidstore_end_iterate(iter);
/* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ tidstore_reset(dead_items);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1111,7 +1120,7 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(tidstore_num_tids(dead_items) == 0);
}
/*
@@ -1264,7 +1273,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (tidstore_num_tids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1863,25 +1872,16 @@ retry:
*/
if (lpdead_items > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TIDStore *dead_items = vacrel->dead_items;
Assert(!prunestate->all_visible);
Assert(prunestate->has_lpdead_items);
vacrel->lpdead_item_pages++;
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
}
/* Finally, add page-local counts to whole-VACUUM counts */
@@ -2088,8 +2088,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TIDStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2098,17 +2097,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2157,7 +2149,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
return;
}
@@ -2186,7 +2178,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2213,8 +2205,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2259,7 +2251,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
}
/*
@@ -2331,7 +2323,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2368,10 +2360,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index;
BlockNumber vacuumed_pages;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TIDStoreIter *iter;
+ TIDStoreIterResult *result;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2388,8 +2381,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuumed_pages = 0;
- index = 0;
- while (index < vacrel->dead_items->num_items)
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ while ((result = tidstore_iterate_next(iter)) != NULL)
{
BlockNumber tblk;
Buffer buf;
@@ -2398,12 +2391,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- tblk = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ tblk = result->blkno;
vacrel->blkno = tblk;
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, tblk, buf, index, &vmbuffer);
+ lazy_vacuum_heap_page(vacrel, tblk, result->offsets, result->num_offsets,
+ buf, &vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2413,6 +2407,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
RecordPageWithFreeSpace(vacrel->rel, tblk, freespace);
vacuumed_pages++;
}
+ tidstore_end_iterate(iter);
/* Clear the block number information */
vacrel->blkno = InvalidBlockNumber;
@@ -2427,14 +2422,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2451,11 +2445,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* LP_DEAD item on the page. The return value is the first index immediately
* after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer *vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets, Buffer buffer, Buffer *vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int uncnt = 0;
@@ -2474,16 +2467,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = offsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2563,7 +2551,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -3065,46 +3052,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
return vacrel->nonempty_pages;
}
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
/*
* Allocate dead_items (either using palloc, or in dynamic shared memory).
* Sets dead_items in vacrel for caller.
@@ -3115,11 +3062,9 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
+ int vac_work_mem = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
@@ -3146,7 +3091,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
+ vac_work_mem,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3159,11 +3104,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = tidstore_create(vac_work_mem, NULL);
}
/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..bc42144f08 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1165,7 +1165,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
END AS phase,
S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
- S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+ S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 293b84bbca..7f5776fbf8 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -95,7 +95,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int vac_cmp_itemptr(const void *left, const void *right);
/*
* Primary entry point for manual VACUUM and ANALYZE commands
@@ -2276,16 +2275,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TIDStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ tidstore_num_tids(dead_items))));
return istat;
}
@@ -2316,18 +2315,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
@@ -2338,60 +2325,7 @@ vac_max_items_to_alloc_size(int max_items)
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch((void *) itemptr,
- (void *) dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
-
- return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
- BlockNumber lblk,
- rblk;
- OffsetNumber loff,
- roff;
-
- lblk = ItemPointerGetBlockNumber((ItemPointer) left);
- rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
- if (lblk < rblk)
- return -1;
- if (lblk > rblk)
- return 1;
-
- loff = ItemPointerGetOffsetNumber((ItemPointer) left);
- roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
- if (loff < roff)
- return -1;
- if (loff > roff)
- return 1;
+ TIDStore *dead_items = (TIDStore *) state;
- return 0;
+ return tidstore_lookup_tid(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index f26d796e52..429607d5fa 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
* use small integers.
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 3
#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 4
#define PARALLEL_VACUUM_KEY_WAL_USAGE 5
@@ -103,6 +103,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TIDStore */
+ tidstore_handle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
+ dsa_area *dead_items_area;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = tidstore_create(vac_work_mem, dead_items_dsa);
+ pvs->dead_items = dead_items;
+ pvs->dead_items_area = dead_items_dsa;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = tidstore_get_handle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ tidstore_free(pvs->dead_items);
+ dsa_detach(pvs->dead_items_area);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TIDStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TIDStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ tidstore_detach(pvs.dead_items);
+ dsa_detach(dead_items_area);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 528b2e9643..ea8cf6283b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -186,6 +186,8 @@ static const char *const BuiltinTrancheNames[] = {
"PgStatsHash",
/* LWTRANCHE_PGSTATS_DATA: */
"PgStatsData",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..3afc7612ae
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,50 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * TID storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "lib/radixtree.h"
+#include "storage/itemptr.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TIDStore TIDStore;
+typedef struct TIDStoreIter TIDStoreIter;
+
+typedef struct TIDStoreIterResult
+{
+ BlockNumber blkno;
+ OffsetNumber offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+ int num_offsets;
+} TIDStoreIterResult;
+
+extern TIDStore *tidstore_create(uint64 max_bytes, dsa_area *dsa);
+extern TIDStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TIDStore *ts);
+extern void tidstore_free(TIDStore *ts);
+extern void tidstore_reset(TIDStore *ts);
+extern void tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TIDStore *ts, ItemPointer tid);
+extern TIDStoreIter * tidstore_begin_iterate(TIDStore *ts);
+extern TIDStoreIterResult *tidstore_iterate_next(TIDStoreIter *iter);
+extern void tidstore_end_iterate(TIDStoreIter *iter);
+extern uint64 tidstore_num_tids(TIDStore *ts);
+extern bool tidstore_is_full(TIDStore *ts);
+extern uint64 tidstore_max_memory(TIDStore *ts);
+extern uint64 tidstore_memory_usage(TIDStore *ts);
+extern tidstore_handle tidstore_get_handle(TIDStore *ts);
+
+#endif /* TIDSTORE_H */
+
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index a28938caf4..75d540d315 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
#define PROGRESS_VACUUM_HEAP_BLKS_SCANNED 2
#define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED 3
#define PROGRESS_VACUUM_NUM_INDEX_VACUUMS 4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES 5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES 5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES 6
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 4e4bc26a8b..afe61c21fd 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -235,21 +236,6 @@ typedef struct VacuumParams
int nworkers;
} VacuumParams;
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -302,18 +288,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TIDStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int vac_work_mem,
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TIDStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index dd818e16ab..f1e0bcede5 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -204,6 +204,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DSA,
LWTRANCHE_PGSTATS_HASH,
LWTRANCHE_PGSTATS_DATA,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..0c49354f04 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param3 AS heap_blks_scanned,
s.param4 AS heap_blks_vacuumed,
s.param5 AS index_vacuum_count,
- s.param6 AS max_dead_tuples,
- s.param7 AS num_dead_tuples
+ s.param6 AS max_dead_tuple_bytes,
+ s.param7 AS dead_tuple_bytes
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT s.stats_reset,
--
2.31.1
v15-0006-Use-rt_node_ptr-to-reference-radix-tree-nodes.patchapplication/octet-stream; name=v15-0006-Use-rt_node_ptr-to-reference-radix-tree-nodes.patchDownload
From 7e5fd8a19adb0305f77618231364eacaa2e0a59a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 14 Nov 2022 11:44:17 +0900
Subject: [PATCH v15 6/9] Use rt_node_ptr to reference radix tree nodes.
---
src/backend/lib/radixtree.c | 688 +++++++++++++++++++++---------------
1 file changed, 398 insertions(+), 290 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index abd0450727..bff37a2c35 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -150,6 +150,19 @@ typedef enum rt_size_class
#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
} rt_size_class;
+/*
+ * rt_pointer is a pointer compatible with a pointer to local memory and a
+ * pointer for DSA area (i.e. dsa_pointer). Since the radix tree node can be
+ * allocated in backend local memory as well as DSA area, we cannot use a
+ * C-pointer to rt_node (i.e. backend local memory address) for child pointers
+ * in inner nodes. Inner nodes need to use rt_pointer instead. We can get
+ * the backend local memory address of a node from a rt_pointer by using
+ * rt_pointer_decode().
+*/
+typedef uintptr_t rt_pointer;
+#define InvalidRTPointer ((rt_pointer) 0)
+#define RTPointerIsValid(x) (((rt_pointer) (x)) != InvalidRTPointer)
+
/* Common type for all nodes types */
typedef struct rt_node
{
@@ -175,8 +188,7 @@ typedef struct rt_node
/* Node kind, one per search/set algorithm */
uint8 kind;
} rt_node;
-#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
-#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
+#define RT_NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
#define VAR_NODE_HAS_FREE_SLOT(node) \
((node)->base.n.count < (node)->base.n.fanout)
#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
@@ -240,7 +252,7 @@ typedef struct rt_node_inner_4
rt_node_base_4 base;
/* number of children depends on size class */
- rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+ rt_pointer children[FLEXIBLE_ARRAY_MEMBER];
} rt_node_inner_4;
typedef struct rt_node_leaf_4
@@ -256,7 +268,7 @@ typedef struct rt_node_inner_32
rt_node_base_32 base;
/* number of children depends on size class */
- rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+ rt_pointer children[FLEXIBLE_ARRAY_MEMBER];
} rt_node_inner_32;
typedef struct rt_node_leaf_32
@@ -272,7 +284,7 @@ typedef struct rt_node_inner_125
rt_node_base_125 base;
/* number of children depends on size class */
- rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+ rt_pointer children[FLEXIBLE_ARRAY_MEMBER];
} rt_node_inner_125;
typedef struct rt_node_leaf_125
@@ -292,7 +304,7 @@ typedef struct rt_node_inner_256
rt_node_base_256 base;
/* Slots for 256 children */
- rt_node *children[RT_NODE_MAX_SLOTS];
+ rt_pointer children[RT_NODE_MAX_SLOTS];
} rt_node_inner_256;
typedef struct rt_node_leaf_256
@@ -306,6 +318,29 @@ typedef struct rt_node_leaf_256
uint64 values[RT_NODE_MAX_SLOTS];
} rt_node_leaf_256;
+/* rt_node_ptr is a data structure representing a pointer for a rt_node */
+typedef struct rt_node_ptr
+{
+ rt_pointer encoded;
+ rt_node *decoded;
+} rt_node_ptr;
+#define InvalidRTNodePtr \
+ (rt_node_ptr) {.encoded = InvalidRTPointer, .decoded = NULL}
+#define RTNodePtrIsValid(n) \
+ (!rt_node_ptr_eq((rt_node_ptr *) &(n), &(InvalidRTNodePtr)))
+
+/* Macros for rt_node_ptr to access the fields of rt_node */
+#define NODE_RAW(n) (n.decoded)
+#define NODE_IS_LEAF(n) (NODE_RAW(n)->shift == 0)
+#define NODE_IS_EMPTY(n) (NODE_COUNT(n) == 0)
+#define NODE_KIND(n) (NODE_RAW(n)->kind)
+#define NODE_COUNT(n) (NODE_RAW(n)->count)
+#define NODE_SHIFT(n) (NODE_RAW(n)->shift)
+#define NODE_CHUNK(n) (NODE_RAW(n)->chunk)
+#define NODE_FANOUT(n) (NODE_RAW(n)->fanout)
+#define NODE_HAS_FREE_SLOT(n) \
+ (NODE_COUNT(n) < rt_node_kind_info[NODE_KIND(n)].fanout)
+
/* Information for each size class */
typedef struct rt_size_class_elem
{
@@ -394,7 +429,7 @@ static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
*/
typedef struct rt_node_iter
{
- rt_node *node; /* current node being iterated */
+ rt_node_ptr node; /* current node being iterated */
int current_idx; /* current position. -1 for initial value */
} rt_node_iter;
@@ -415,7 +450,7 @@ struct radix_tree
{
MemoryContext context;
- rt_node *root;
+ rt_pointer root;
uint64 max_val;
uint64 num_keys;
@@ -429,27 +464,58 @@ struct radix_tree
};
static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
-static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+
+static rt_node_ptr rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node_ptr node, uint8 kind, rt_size_class size_class,
bool inner);
-static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_free_node(radix_tree *tree, rt_node_ptr node);
static void rt_extend(radix_tree *tree, uint64 key);
-static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
- rt_node **child_p);
-static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+static inline bool rt_node_search_inner(rt_node_ptr node_ptr, uint64 key, rt_action action,
+ rt_pointer *child_p);
+static inline bool rt_node_search_leaf(rt_node_ptr node_ptr, uint64 key, rt_action action,
uint64 *value_p);
-static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
- uint64 key, rt_node *child);
-static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+static bool rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+ uint64 key, rt_node_ptr child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
uint64 key, uint64 value);
-static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ rt_node_ptr *child_p);
static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
uint64 *value_p);
-static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static void rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from);
static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
/* verification (available only with assertion) */
-static void rt_verify_node(rt_node *node);
+static void rt_verify_node(rt_node_ptr node);
+
+/* Decode and encode functions of rt_pointer */
+static inline rt_node *
+rt_pointer_decode(rt_pointer encoded)
+{
+ return (rt_node *) encoded;
+}
+
+static inline rt_pointer
+rt_pointer_encode(rt_node *decoded)
+{
+ return (rt_pointer) decoded;
+}
+
+/* Return a rt_node_ptr created from the given encoded pointer */
+static inline rt_node_ptr
+rt_node_ptr_encoded(rt_pointer encoded)
+{
+ return (rt_node_ptr) {
+ .encoded = encoded,
+ .decoded = rt_pointer_decode(encoded),
+ };
+}
+
+static inline bool
+rt_node_ptr_eq(rt_node_ptr *a, rt_node_ptr *b)
+{
+ return (a->decoded == b->decoded) && (a->encoded == b->encoded);
+}
/*
* Return index of the first element in 'base' that equals 'key'. Return -1
@@ -598,10 +664,10 @@ node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
/* Shift the elements right at 'idx' by one */
static inline void
-chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_shift(uint8 *chunks, rt_pointer *children, int count, int idx)
{
memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
- memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_pointer) * (count - idx));
}
static inline void
@@ -613,10 +679,10 @@ chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
/* Delete the element at 'idx' */
static inline void
-chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_delete(uint8 *chunks, rt_pointer *children, int count, int idx)
{
memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
- memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_pointer) * (count - idx - 1));
}
static inline void
@@ -628,12 +694,12 @@ chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
/* Copy both chunks and children/values arrays */
static inline void
-chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
- uint8 *dst_chunks, rt_node **dst_children)
+chunk_children_array_copy(uint8 *src_chunks, rt_pointer *src_children,
+ uint8 *dst_chunks, rt_pointer *dst_children)
{
const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
- const Size children_size = sizeof(rt_node *) * fanout;
+ const Size children_size = sizeof(rt_pointer) * fanout;
memcpy(dst_chunks, src_chunks, chunk_size);
memcpy(dst_children, src_children, children_size);
@@ -665,7 +731,7 @@ node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
static inline bool
node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
Assert(slot < node->base.n.fanout);
return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
}
@@ -673,23 +739,23 @@ node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
static inline bool
node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
Assert(slot < node->base.n.fanout);
return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
}
#endif
-static inline rt_node *
+static inline rt_pointer
node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
return node->children[node->base.slot_idxs[chunk]];
}
static inline uint64
node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
return node->values[node->base.slot_idxs[chunk]];
}
@@ -699,9 +765,9 @@ node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
{
int slotpos = node->base.slot_idxs[chunk];
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
- node->children[node->base.slot_idxs[chunk]] = NULL;
+ node->children[node->base.slot_idxs[chunk]] = InvalidRTPointer;
node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
}
@@ -710,7 +776,7 @@ node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
{
int slotpos = node->base.slot_idxs[chunk];
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
}
@@ -742,11 +808,11 @@ node_125_find_unused_slot(bitmapword *isset)
}
static inline void
-node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_pointer child)
{
int slotpos;
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
slotpos = node_125_find_unused_slot(node->base.isset);
Assert(slotpos < node->base.n.fanout);
@@ -761,7 +827,7 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
{
int slotpos;
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
slotpos = node_125_find_unused_slot(node->base.isset);
Assert(slotpos < node->base.n.fanout);
@@ -772,16 +838,16 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
/* Update the child corresponding to 'chunk' to 'child' */
static inline void
-node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_pointer child)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->children[node->base.slot_idxs[chunk]] = child;
}
static inline void
node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->values[node->base.slot_idxs[chunk]] = value;
}
@@ -791,21 +857,21 @@ node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
static inline bool
node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
- return (node->children[chunk] != NULL);
+ Assert(!RT_NODE_IS_LEAF(node));
+ return RTPointerIsValid(node->children[chunk]);
}
static inline bool
node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
}
-static inline rt_node *
+static inline rt_pointer
node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
Assert(node_inner_256_is_chunk_used(node, chunk));
return node->children[chunk];
}
@@ -813,16 +879,16 @@ node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
static inline uint64
node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
Assert(node_leaf_256_is_chunk_used(node, chunk));
return node->values[chunk];
}
/* Set the child in the node-256 */
static inline void
-node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_pointer child)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->children[chunk] = child;
}
@@ -830,7 +896,7 @@ node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
static inline void
node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
node->values[chunk] = value;
}
@@ -839,14 +905,14 @@ node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
static inline void
node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
- node->children[chunk] = NULL;
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = InvalidRTPointer;
}
static inline void
node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
}
@@ -882,29 +948,32 @@ rt_new_root(radix_tree *tree, uint64 key)
{
int shift = key_get_shift(key);
bool inner = shift > 0;
- rt_node *newnode;
+ rt_node_ptr newnode;
newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
- newnode->shift = shift;
+ NODE_SHIFT(newnode) = shift;
+
tree->max_val = shift_get_max_val(shift);
- tree->root = newnode;
+ tree->root = newnode.encoded;
}
/*
* Allocate a new node with the given node kind.
*/
-static rt_node *
+static rt_node_ptr
rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
{
- rt_node *newnode;
+ rt_node_ptr newnode;
if (inner)
- newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
- rt_size_class_info[size_class].inner_size);
+ newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+ rt_size_class_info[size_class].inner_size);
else
- newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
- rt_size_class_info[size_class].leaf_size);
+ newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ rt_size_class_info[size_class].leaf_size);
+
+ newnode.encoded = rt_pointer_encode(newnode.decoded);
#ifdef RT_DEBUG
/* update the statistics */
@@ -916,20 +985,20 @@ rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
/* Initialize the node contents */
static inline void
-rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+rt_init_node(rt_node_ptr node, uint8 kind, rt_size_class size_class, bool inner)
{
if (inner)
- MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+ MemSet(node.decoded, 0, rt_size_class_info[size_class].inner_size);
else
- MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+ MemSet(node.decoded, 0, rt_size_class_info[size_class].leaf_size);
- node->kind = kind;
- node->fanout = rt_size_class_info[size_class].fanout;
+ NODE_KIND(node) = kind;
+ NODE_FANOUT(node) = rt_size_class_info[size_class].fanout;
/* Initialize slot_idxs to invalid values */
if (kind == RT_NODE_KIND_125)
{
- rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node.decoded;
memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
}
@@ -939,25 +1008,25 @@ rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
* and this is the max size class to it will never grow.
*/
if (kind == RT_NODE_KIND_256)
- node->fanout = 0;
+ NODE_FANOUT(node) = 0;
}
static inline void
-rt_copy_node(rt_node *newnode, rt_node *oldnode)
+rt_copy_node(rt_node_ptr newnode, rt_node_ptr oldnode)
{
- newnode->shift = oldnode->shift;
- newnode->chunk = oldnode->chunk;
- newnode->count = oldnode->count;
+ NODE_SHIFT(newnode) = NODE_SHIFT(oldnode);
+ NODE_CHUNK(newnode) = NODE_CHUNK(oldnode);
+ NODE_COUNT(newnode) = NODE_COUNT(oldnode);
}
/*
* Create a new node with 'new_kind' and the same shift, chunk, and
* count of 'node'.
*/
-static rt_node*
-rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+static rt_node_ptr
+rt_grow_node_kind(radix_tree *tree, rt_node_ptr node, uint8 new_kind)
{
- rt_node *newnode;
+ rt_node_ptr newnode;
bool inner = !NODE_IS_LEAF(node);
newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
@@ -969,12 +1038,12 @@ rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
/* Free the given node */
static void
-rt_free_node(radix_tree *tree, rt_node *node)
+rt_free_node(radix_tree *tree, rt_node_ptr node)
{
/* If we're deleting the root node, make the tree empty */
- if (tree->root == node)
+ if (tree->root == node.encoded)
{
- tree->root = NULL;
+ tree->root = InvalidRTPointer;
tree->max_val = 0;
}
@@ -985,7 +1054,7 @@ rt_free_node(radix_tree *tree, rt_node *node)
/* update the statistics */
for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
- if (node->fanout == rt_size_class_info[i].fanout)
+ if (NODE_FANOUT(node) == rt_size_class_info[i].fanout)
break;
}
@@ -998,29 +1067,30 @@ rt_free_node(radix_tree *tree, rt_node *node)
}
#endif
- pfree(node);
+ pfree(node.decoded);
}
/*
* Replace old_child with new_child, and free the old one.
*/
static void
-rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
- rt_node *new_child, uint64 key)
+rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
+ rt_node_ptr new_child, uint64 key)
{
- Assert(old_child->chunk == new_child->chunk);
- Assert(old_child->shift == new_child->shift);
+ Assert(NODE_CHUNK(old_child) == NODE_CHUNK(new_child));
+ Assert(NODE_SHIFT(old_child) == NODE_SHIFT(new_child));
- if (parent == old_child)
+ if (rt_node_ptr_eq(&parent, &old_child))
{
/* Replace the root node with the new large node */
- tree->root = new_child;
+ tree->root = new_child.encoded;
}
else
{
bool replaced PG_USED_FOR_ASSERTS_ONLY;
- replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+ replaced = rt_node_insert_inner(tree, InvalidRTNodePtr, parent, key,
+ new_child);
Assert(replaced);
}
@@ -1035,24 +1105,28 @@ static void
rt_extend(radix_tree *tree, uint64 key)
{
int target_shift;
- int shift = tree->root->shift + RT_NODE_SPAN;
+ rt_node *root = rt_pointer_decode(tree->root);
+ int shift = root->shift + RT_NODE_SPAN;
target_shift = key_get_shift(key);
/* Grow tree from 'shift' to 'target_shift' */
while (shift <= target_shift)
{
- rt_node_inner_4 *node;
+ rt_node_ptr node;
+ rt_node_inner_4 *n4;
+
+ node = rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+ rt_init_node(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
- node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
- rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
- node->base.n.shift = shift;
- node->base.n.count = 1;
- node->base.chunks[0] = 0;
- node->children[0] = tree->root;
+ n4 = (rt_node_inner_4 *) node.decoded;
+ n4->base.n.shift = shift;
+ n4->base.n.count = 1;
+ n4->base.chunks[0] = 0;
+ n4->children[0] = tree->root;
- tree->root->chunk = 0;
- tree->root = (rt_node *) node;
+ root->chunk = 0;
+ tree->root = node.encoded;
shift += RT_NODE_SPAN;
}
@@ -1065,21 +1139,22 @@ rt_extend(radix_tree *tree, uint64 key)
* Insert inner and leaf nodes from 'node' to bottom.
*/
static inline void
-rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
- rt_node *node)
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
+ rt_node_ptr node)
{
- int shift = node->shift;
+ int shift = NODE_SHIFT(node);
while (shift >= RT_NODE_SPAN)
{
- rt_node *newchild;
+ rt_node_ptr newchild;
int newshift = shift - RT_NODE_SPAN;
bool inner = newshift > 0;
newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
- newchild->shift = newshift;
- newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ NODE_SHIFT(newchild) = newshift;
+ NODE_CHUNK(newchild) = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
+
rt_node_insert_inner(tree, parent, node, key, newchild);
parent = node;
@@ -1099,17 +1174,18 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
* pointer is set to child_p.
*/
static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
+ rt_pointer *child_p)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool found = false;
- rt_node *child = NULL;
+ rt_pointer child;
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
if (idx < 0)
@@ -1127,7 +1203,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
}
case RT_NODE_KIND_32:
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
if (idx < 0)
@@ -1143,7 +1219,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
}
case RT_NODE_KIND_125:
{
- rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
break;
@@ -1159,7 +1235,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
}
case RT_NODE_KIND_256:
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
if (!node_inner_256_is_chunk_used(n256, chunk))
break;
@@ -1176,7 +1252,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
/* update statistics */
if (action == RT_ACTION_DELETE && found)
- node->count--;
+ NODE_COUNT(node)--;
if (found && child_p)
*child_p = child;
@@ -1192,17 +1268,17 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
* to the value is set to value_p.
*/
static inline bool
-rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+rt_node_search_leaf(rt_node_ptr node, uint64 key, rt_action action, uint64 *value_p)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool found = false;
uint64 value = 0;
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
if (idx < 0)
@@ -1220,7 +1296,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
}
case RT_NODE_KIND_32:
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
if (idx < 0)
@@ -1236,7 +1312,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
}
case RT_NODE_KIND_125:
{
- rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
break;
@@ -1252,7 +1328,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
}
case RT_NODE_KIND_256:
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
if (!node_leaf_256_is_chunk_used(n256, chunk))
break;
@@ -1269,7 +1345,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
/* update statistics */
if (action == RT_ACTION_DELETE && found)
- node->count--;
+ NODE_COUNT(node)--;
if (found && value_p)
*value_p = value;
@@ -1279,19 +1355,19 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
/* Insert the child to the inner node */
static bool
-rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
- rt_node *child)
+rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+ uint64 key, rt_node_ptr child)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool chunk_exists = false;
Assert(!NODE_IS_LEAF(node));
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
int idx;
idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1299,25 +1375,27 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
{
/* found the existing chunk */
chunk_exists = true;
- n4->children[idx] = child;
+ n4->children[idx] = child.encoded;
break;
}
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
{
+ rt_node_ptr new;
rt_node_inner_32 *new32;
- Assert(parent != NULL);
+
+ Assert(RTNodePtrIsValid(parent));
/* grow node from 4 to 32 */
- new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+ new32 = (rt_node_inner_32 *) new.decoded;
+
chunk_children_array_copy(n4->base.chunks, n4->children,
new32->base.chunks, new32->children);
- Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
- key);
- node = (rt_node *) new32;
+ Assert(RTNodePtrIsValid(parent));
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1330,14 +1408,14 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
count, insertpos);
n4->base.chunks[insertpos] = chunk;
- n4->children[insertpos] = child;
+ n4->children[insertpos] = child.encoded;
break;
}
}
/* FALLTHROUGH */
case RT_NODE_KIND_32:
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
int idx;
idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1345,45 +1423,52 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
{
/* found the existing chunk */
chunk_exists = true;
- n32->children[idx] = child;
+ n32->children[idx] = child.encoded;
break;
}
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
{
- Assert(parent != NULL);
+ Assert(RTNodePtrIsValid(parent));
if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
{
/* use the same node kind, but expand to the next size class */
const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size;
const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+ rt_node_ptr new;
rt_node_inner_32 *new32;
- new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+ new = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+ new32 = (rt_node_inner_32 *) new.decoded;
memcpy(new32, n32, size);
new32->base.n.fanout = fanout;
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+ rt_replace_node(tree, parent, node, new, key);
- /* must update both pointers here */
- node = (rt_node *) new32;
+ /*
+ * Must update both pointers here since we update n32 and
+ * verify node.
+ */
+ node = new;
n32 = new32;
goto retry_insert_inner_32;
}
else
{
+ rt_node_ptr new;
rt_node_inner_125 *new125;
/* grow node from 32 to 125 */
- new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
- RT_NODE_KIND_125);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+ new125 = (rt_node_inner_125 *) new.decoded;
+
for (int i = 0; i < n32->base.n.count; i++)
node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
- node = (rt_node *) new125;
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
}
else
@@ -1398,7 +1483,7 @@ retry_insert_inner_32:
count, insertpos);
n32->base.chunks[insertpos] = chunk;
- n32->children[insertpos] = child;
+ n32->children[insertpos] = child.encoded;
break;
}
}
@@ -1406,25 +1491,28 @@ retry_insert_inner_32:
/* FALLTHROUGH */
case RT_NODE_KIND_125:
{
- rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
int cnt = 0;
if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
{
/* found the existing chunk */
chunk_exists = true;
- node_inner_125_update(n125, chunk, child);
+ node_inner_125_update(n125, chunk, child.encoded);
break;
}
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
{
+ rt_node_ptr new;
rt_node_inner_256 *new256;
- Assert(parent != NULL);
+
+ Assert(RTNodePtrIsValid(parent));
/* grow node from 125 to 256 */
- new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
- RT_NODE_KIND_256);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+ new256 = (rt_node_inner_256 *) new.decoded;
+
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
{
if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
@@ -1434,32 +1522,31 @@ retry_insert_inner_32:
cnt++;
}
- rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
- key);
- node = (rt_node *) new256;
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
- node_inner_125_insert(n125, chunk, child);
+ node_inner_125_insert(n125, chunk, child.encoded);
break;
}
}
/* FALLTHROUGH */
case RT_NODE_KIND_256:
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
- node_inner_256_set(n256, chunk, child);
+ node_inner_256_set(n256, chunk, child.encoded);
break;
}
}
/* Update statistics */
if (!chunk_exists)
- node->count++;
+ NODE_COUNT(node)++;
/*
* Done. Finally, verify the chunk and value is inserted or replaced
@@ -1472,19 +1559,19 @@ retry_insert_inner_32:
/* Insert the value to the leaf node */
static bool
-rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
uint64 key, uint64 value)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ uint8 chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
bool chunk_exists = false;
Assert(NODE_IS_LEAF(node));
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
int idx;
idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1498,16 +1585,18 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
{
+ rt_node_ptr new;
rt_node_leaf_32 *new32;
- Assert(parent != NULL);
+
+ Assert(RTNodePtrIsValid(parent));
/* grow node from 4 to 32 */
- new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+ new32 = (rt_node_leaf_32 *) new.decoded;
chunk_values_array_copy(n4->base.chunks, n4->values,
new32->base.chunks, new32->values);
- rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
- node = (rt_node *) new32;
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1527,7 +1616,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* FALLTHROUGH */
case RT_NODE_KIND_32:
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
int idx;
idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1541,45 +1630,51 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
{
- Assert(parent != NULL);
+ Assert(RTNodePtrIsValid(parent));
if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
{
/* use the same node kind, but expand to the next size class */
const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+ rt_node_ptr new;
rt_node_leaf_32 *new32;
- new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+ new = rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+ new32 = (rt_node_leaf_32 *) new.decoded;
memcpy(new32, n32, size);
new32->base.n.fanout = fanout;
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+ rt_replace_node(tree, parent, node, new, key);
- /* must update both pointers here */
- node = (rt_node *) new32;
+ /*
+ * Must update both pointers here since we update n32 and
+ * verify node.
+ */
+ node = new;
n32 = new32;
goto retry_insert_leaf_32;
}
else
{
+ rt_node_ptr new;
rt_node_leaf_125 *new125;
/* grow node from 32 to 125 */
- new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
- RT_NODE_KIND_125);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+ new125 = (rt_node_leaf_125 *) new.decoded;
+
for (int i = 0; i < n32->base.n.count; i++)
node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
- key);
- node = (rt_node *) new125;
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
}
else
{
- retry_insert_leaf_32:
+retry_insert_leaf_32:
{
int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
int count = n32->base.n.count;
@@ -1597,7 +1692,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* FALLTHROUGH */
case RT_NODE_KIND_125:
{
- rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
int cnt = 0;
if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
@@ -1610,12 +1705,14 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
{
+ rt_node_ptr new;
rt_node_leaf_256 *new256;
- Assert(parent != NULL);
+
+ Assert(RTNodePtrIsValid(parent));
/* grow node from 125 to 256 */
- new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
- RT_NODE_KIND_256);
+ new = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+ new256 = (rt_node_leaf_256 *) new.decoded;
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
{
if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
@@ -1625,9 +1722,8 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
cnt++;
}
- rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
- key);
- node = (rt_node *) new256;
+ rt_replace_node(tree, parent, node, new, key);
+ node = new;
}
else
{
@@ -1638,7 +1734,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* FALLTHROUGH */
case RT_NODE_KIND_256:
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
@@ -1650,7 +1746,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/* Update statistics */
if (!chunk_exists)
- node->count++;
+ NODE_COUNT(node)++;
/*
* Done. Finally, verify the chunk and value is inserted or replaced
@@ -1674,7 +1770,7 @@ rt_create(MemoryContext ctx)
tree = palloc(sizeof(radix_tree));
tree->context = ctx;
- tree->root = NULL;
+ tree->root = InvalidRTPointer;
tree->max_val = 0;
tree->num_keys = 0;
@@ -1723,26 +1819,23 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
{
int shift;
bool updated;
- rt_node *node;
- rt_node *parent;
+ rt_node_ptr node;
+ rt_node_ptr parent;
/* Empty tree, create the root */
- if (!tree->root)
+ if (!RTPointerIsValid(tree->root))
rt_new_root(tree, key);
/* Extend the tree if necessary */
if (key > tree->max_val)
rt_extend(tree, key);
- Assert(tree->root);
-
- shift = tree->root->shift;
- node = parent = tree->root;
-
/* Descend the tree until a leaf node */
+ node = parent = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
while (shift >= 0)
{
- rt_node *child;
+ rt_pointer child;
if (NODE_IS_LEAF(node))
break;
@@ -1754,7 +1847,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
}
parent = node;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
}
@@ -1775,21 +1868,21 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
bool
rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
{
- rt_node *node;
+ rt_node_ptr node;
int shift;
Assert(value_p != NULL);
- if (!tree->root || key > tree->max_val)
+ if (!RTPointerIsValid(tree->root) || key > tree->max_val)
return false;
- node = tree->root;
- shift = tree->root->shift;
+ node = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
/* Descend the tree until a leaf node */
while (shift >= 0)
{
- rt_node *child;
+ rt_pointer child;
if (NODE_IS_LEAF(node))
break;
@@ -1797,7 +1890,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
}
@@ -1811,8 +1904,8 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
bool
rt_delete(radix_tree *tree, uint64 key)
{
- rt_node *node;
- rt_node *stack[RT_MAX_LEVEL] = {0};
+ rt_node_ptr node;
+ rt_node_ptr stack[RT_MAX_LEVEL] = {0};
int shift;
int level;
bool deleted;
@@ -1824,12 +1917,12 @@ rt_delete(radix_tree *tree, uint64 key)
* Descend the tree to search the key while building a stack of nodes we
* visited.
*/
- node = tree->root;
- shift = tree->root->shift;
+ node = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
level = -1;
while (shift > 0)
{
- rt_node *child;
+ rt_pointer child;
/* Push the current node to the stack */
stack[++level] = node;
@@ -1837,7 +1930,7 @@ rt_delete(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
return false;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
}
@@ -1888,6 +1981,7 @@ rt_iter *
rt_begin_iterate(radix_tree *tree)
{
MemoryContext old_ctx;
+ rt_node_ptr root;
rt_iter *iter;
int top_level;
@@ -1897,17 +1991,18 @@ rt_begin_iterate(radix_tree *tree)
iter->tree = tree;
/* empty tree */
- if (!iter->tree->root)
+ if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->root))
return iter;
- top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ root = rt_node_ptr_encoded(iter->tree->root);
+ top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
iter->stack_len = top_level;
/*
* Descend to the left most leaf node from the root. The key is being
* constructed while descending to the leaf.
*/
- rt_update_iter_stack(iter, iter->tree->root, top_level);
+ rt_update_iter_stack(iter, root, top_level);
MemoryContextSwitchTo(old_ctx);
@@ -1918,14 +2013,15 @@ rt_begin_iterate(radix_tree *tree)
* Update each node_iter for inner nodes in the iterator node stack.
*/
static void
-rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
{
int level = from;
- rt_node *node = from_node;
+ rt_node_ptr node = from_node;
for (;;)
{
rt_node_iter *node_iter = &(iter->stack[level--]);
+ bool found PG_USED_FOR_ASSERTS_ONLY;
node_iter->node = node;
node_iter->current_idx = -1;
@@ -1935,10 +2031,10 @@ rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
return;
/* Advance to the next slot in the inner node */
- node = rt_node_inner_iterate_next(iter, node_iter);
+ found = rt_node_inner_iterate_next(iter, node_iter, &node);
/* We must find the first children in the node */
- Assert(node);
+ Assert(found);
}
}
@@ -1955,7 +2051,7 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
for (;;)
{
- rt_node *child = NULL;
+ rt_node_ptr child = InvalidRTNodePtr;
uint64 value;
int level;
bool found;
@@ -1976,14 +2072,12 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
*/
for (level = 1; level <= iter->stack_len; level++)
{
- child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
-
- if (child)
+ if (rt_node_inner_iterate_next(iter, &(iter->stack[level]), &child))
break;
}
/* the iteration finished */
- if (!child)
+ if (!RTNodePtrIsValid(child))
return false;
/*
@@ -2015,18 +2109,19 @@ rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
* Advance the slot in the inner node. Return the child if exists, otherwise
* null.
*/
-static inline rt_node *
-rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+static inline bool
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *child_p)
{
- rt_node *child = NULL;
+ rt_node_ptr node = node_iter->node;
+ rt_pointer child;
bool found = false;
uint8 key_chunk;
- switch (node_iter->node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n4->base.n.count)
@@ -2039,7 +2134,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
case RT_NODE_KIND_32:
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n32->base.n.count)
@@ -2052,7 +2147,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
case RT_NODE_KIND_125:
{
- rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2072,7 +2167,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
case RT_NODE_KIND_256:
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2093,9 +2188,12 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
}
if (found)
- rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ {
+ rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
+ *child_p = rt_node_ptr_encoded(child);
+ }
- return child;
+ return found;
}
/*
@@ -2103,19 +2201,18 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
* is set to value_p, otherwise return false.
*/
static inline bool
-rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
- uint64 *value_p)
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_p)
{
- rt_node *node = node_iter->node;
+ rt_node_ptr node = node_iter->node;
bool found = false;
uint64 value;
uint8 key_chunk;
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n4->base.n.count)
@@ -2128,7 +2225,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
}
case RT_NODE_KIND_32:
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
node_iter->current_idx++;
if (node_iter->current_idx >= n32->base.n.count)
@@ -2141,7 +2238,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
}
case RT_NODE_KIND_125:
{
- rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2161,7 +2258,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
}
case RT_NODE_KIND_256:
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
int i;
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2183,7 +2280,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
if (found)
{
- rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
*value_p = value;
}
@@ -2220,16 +2317,16 @@ rt_memory_usage(radix_tree *tree)
* Verify the radix tree node.
*/
static void
-rt_verify_node(rt_node *node)
+rt_verify_node(rt_node_ptr node)
{
#ifdef USE_ASSERT_CHECKING
- Assert(node->count >= 0);
+ Assert(NODE_COUNT(node) >= 0);
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+ rt_node_base_4 *n4 = (rt_node_base_4 *) node.decoded;
for (int i = 1; i < n4->n.count; i++)
Assert(n4->chunks[i - 1] < n4->chunks[i]);
@@ -2238,7 +2335,7 @@ rt_verify_node(rt_node *node)
}
case RT_NODE_KIND_32:
{
- rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+ rt_node_base_32 *n32 = (rt_node_base_32 *) node.decoded;
for (int i = 1; i < n32->n.count; i++)
Assert(n32->chunks[i - 1] < n32->chunks[i]);
@@ -2247,7 +2344,7 @@ rt_verify_node(rt_node *node)
}
case RT_NODE_KIND_125:
{
- rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node.decoded;
int cnt = 0;
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2257,10 +2354,10 @@ rt_verify_node(rt_node *node)
/* Check if the corresponding slot is used */
if (NODE_IS_LEAF(node))
- Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) node,
+ Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) n125,
n125->slot_idxs[i]));
else
- Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) node,
+ Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) n125,
n125->slot_idxs[i]));
cnt++;
@@ -2273,7 +2370,7 @@ rt_verify_node(rt_node *node)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
int cnt = 0;
for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
@@ -2294,54 +2391,62 @@ rt_verify_node(rt_node *node)
void
rt_stats(radix_tree *tree)
{
+ rt_node *root = rt_pointer_decode(tree->root);
+
+ if (root == NULL)
+ return;
+
ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
- tree->num_keys,
- tree->root->shift / RT_NODE_SPAN,
- tree->cnt[RT_CLASS_4_FULL],
- tree->cnt[RT_CLASS_32_PARTIAL],
- tree->cnt[RT_CLASS_32_FULL],
- tree->cnt[RT_CLASS_125_FULL],
- tree->cnt[RT_CLASS_256])));
+ tree->num_keys,
+ root->shift / RT_NODE_SPAN,
+ tree->cnt[RT_CLASS_4_FULL],
+ tree->cnt[RT_CLASS_32_PARTIAL],
+ tree->cnt[RT_CLASS_32_FULL],
+ tree->cnt[RT_CLASS_125_FULL],
+ tree->cnt[RT_CLASS_256])));
}
static void
-rt_dump_node(rt_node *node, int level, bool recurse)
+rt_dump_node(rt_node_ptr node, int level, bool recurse)
{
- char space[125] = {0};
+ rt_node *n = node.decoded;
+ char space[128] = {0};
fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
NODE_IS_LEAF(node) ? "LEAF" : "INNR",
- (node->kind == RT_NODE_KIND_4) ? 4 :
- (node->kind == RT_NODE_KIND_32) ? 32 :
- (node->kind == RT_NODE_KIND_125) ? 125 : 256,
- node->fanout == 0 ? 256 : node->fanout,
- node->count, node->shift, node->chunk);
+
+ (n->kind == RT_NODE_KIND_4) ? 4 :
+ (n->kind == RT_NODE_KIND_32) ? 32 :
+ (n->kind == RT_NODE_KIND_125) ? 125 : 256,
+ n->fanout == 0 ? 256 : n->fanout,
+ n->count, n->shift, n->chunk);
if (level > 0)
sprintf(space, "%*c", level * 4, ' ');
- switch (node->kind)
+ switch (NODE_KIND(node))
{
case RT_NODE_KIND_4:
{
- for (int i = 0; i < node->count; i++)
+ for (int i = 0; i < NODE_COUNT(node); i++)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
space, n4->base.chunks[i], n4->values[i]);
}
else
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
fprintf(stderr, "%schunk 0x%X ->",
space, n4->base.chunks[i]);
if (recurse)
- rt_dump_node(n4->children[i], level + 1, recurse);
+ rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+ level + 1, recurse);
else
fprintf(stderr, "\n");
}
@@ -2350,25 +2455,26 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
case RT_NODE_KIND_32:
{
- for (int i = 0; i < node->count; i++)
+ for (int i = 0; i < NODE_KIND(node); i++)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
space, n32->base.chunks[i], n32->values[i]);
}
else
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
fprintf(stderr, "%schunk 0x%X ->",
space, n32->base.chunks[i]);
if (recurse)
{
- rt_dump_node(n32->children[i], level + 1, recurse);
+ rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+ level + 1, recurse);
}
else
fprintf(stderr, "\n");
@@ -2378,7 +2484,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
case RT_NODE_KIND_125:
{
- rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+ rt_node_base_125 *b125 = (rt_node_base_125 *) node.decoded;
fprintf(stderr, "slot_idxs ");
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2390,7 +2496,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+ rt_node_leaf_125 *n = (rt_node_leaf_125 *) node.decoded;
fprintf(stderr, ", isset-bitmap:");
for (int i = 0; i < WORDNUM(128); i++)
@@ -2420,7 +2526,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(node_inner_125_get_child(n125, i),
+ rt_dump_node(rt_node_ptr_encoded(node_inner_125_get_child(n125, i)),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2434,7 +2540,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
if (!node_leaf_256_is_chunk_used(n256, i))
continue;
@@ -2444,7 +2550,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
else
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
if (!node_inner_256_is_chunk_used(n256, i))
continue;
@@ -2453,8 +2559,8 @@ rt_dump_node(rt_node *node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
- recurse);
+ rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+ level + 1, recurse);
else
fprintf(stderr, "\n");
}
@@ -2467,7 +2573,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
void
rt_dump_search(radix_tree *tree, uint64 key)
{
- rt_node *node;
+ rt_node_ptr node;
int shift;
int level = 0;
@@ -2475,7 +2581,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
tree->max_val, tree->max_val);
- if (!tree->root)
+ if (!RTPointerIsValid(tree->root))
{
elog(NOTICE, "tree is empty");
return;
@@ -2488,11 +2594,11 @@ rt_dump_search(radix_tree *tree, uint64 key)
return;
}
- node = tree->root;
- shift = tree->root->shift;
+ node = rt_node_ptr_encoded(tree->root);
+ shift = NODE_SHIFT(node);
while (shift >= 0)
{
- rt_node *child;
+ rt_pointer child;
rt_dump_node(node, level, false);
@@ -2509,7 +2615,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
break;
- node = child;
+ node = rt_node_ptr_encoded(child);
shift -= RT_NODE_SPAN;
level++;
}
@@ -2518,6 +2624,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
void
rt_dump(radix_tree *tree)
{
+ rt_node_ptr root;
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
@@ -2528,12 +2635,13 @@ rt_dump(radix_tree *tree)
rt_size_class_info[i].leaf_blocksize);
fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
- if (!tree->root)
+ if (!RTPointerIsValid(tree->root))
{
fprintf(stderr, "empty tree\n");
return;
}
- rt_dump_node(tree->root, 0, true);
+ root = rt_node_ptr_encoded(tree->root);
+ rt_dump_node(root, 0, true);
}
#endif
--
2.31.1
v15-0004-Use-bitmapword-for-node-125.patchapplication/octet-stream; name=v15-0004-Use-bitmapword-for-node-125.patchDownload
From 066eada2c94025a273fa0e49763c6817fcc1906a Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 15:22:26 +0700
Subject: [PATCH v15 4/9] Use bitmapword for node-125
TODO: Rename macros copied from bitmapset.c
---
src/backend/lib/radixtree.c | 70 ++++++++++++++++++-------------------
1 file changed, 34 insertions(+), 36 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index e7f61fd943..abd0450727 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -62,6 +62,7 @@
#include "lib/radixtree.h"
#include "lib/stringinfo.h"
#include "miscadmin.h"
+#include "nodes/bitmapset.h"
#include "port/pg_bitutils.h"
#include "port/pg_lfind.h"
#include "utils/memutils.h"
@@ -103,6 +104,10 @@
#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+/* FIXME rename */
+#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD)
+
/* Enum used rt_node_search() */
typedef enum
{
@@ -207,6 +212,9 @@ typedef struct rt_node_base125
/* The index of slots for each fanout */
uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[WORDNUM(128)];
} rt_node_base_125;
typedef struct rt_node_base256
@@ -271,9 +279,6 @@ typedef struct rt_node_leaf_125
{
rt_node_base_125 base;
- /* isset is a bitmap to track which slot is in use */
- uint8 isset[RT_NODE_NSLOTS_BITS(128)];
-
/* number of values depends on size class */
uint64 values[FLEXIBLE_ARRAY_MEMBER];
} rt_node_leaf_125;
@@ -655,13 +660,14 @@ node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
}
+#ifdef USE_ASSERT_CHECKING
/* Is the slot in the node used? */
static inline bool
node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
{
Assert(!NODE_IS_LEAF(node));
Assert(slot < node->base.n.fanout);
- return (node->children[slot] != NULL);
+ return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
}
static inline bool
@@ -669,8 +675,9 @@ node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
{
Assert(NODE_IS_LEAF(node));
Assert(slot < node->base.n.fanout);
- return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+ return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
}
+#endif
static inline rt_node *
node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
@@ -690,7 +697,10 @@ node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
static void
node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
{
+ int slotpos = node->base.slot_idxs[chunk];
+
Assert(!NODE_IS_LEAF(node));
+ node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
node->children[node->base.slot_idxs[chunk]] = NULL;
node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
}
@@ -701,44 +711,35 @@ node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
int slotpos = node->base.slot_idxs[chunk];
Assert(NODE_IS_LEAF(node));
- node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+ node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
}
/* Return an unused slot in node-125 */
static int
-node_inner_125_find_unused_slot(rt_node_inner_125 *node, uint8 chunk)
-{
- int slotpos = 0;
-
- Assert(!NODE_IS_LEAF(node));
- while (node_inner_125_is_slot_used(node, slotpos))
- slotpos++;
-
- return slotpos;
-}
-
-static int
-node_leaf_125_find_unused_slot(rt_node_leaf_125 *node, uint8 chunk)
+node_125_find_unused_slot(bitmapword *isset)
{
int slotpos;
+ int idx;
+ bitmapword inverse;
- Assert(NODE_IS_LEAF(node));
-
- /* We iterate over the isset bitmap per byte then check each bit */
- for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < WORDNUM(128); idx++)
{
- if (node->isset[slotpos] < 0xFF)
+ if (isset[idx] < ~((bitmapword) 0))
break;
}
- Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
- slotpos *= BITS_PER_BYTE;
- while (node_leaf_125_is_slot_used(node, slotpos))
- slotpos++;
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+
+ /* mark the slot used */
+ isset[idx] |= bmw_rightmost_one(inverse);
return slotpos;
-}
+ }
static inline void
node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
@@ -747,8 +748,7 @@ node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
Assert(!NODE_IS_LEAF(node));
- /* find unused slot */
- slotpos = node_inner_125_find_unused_slot(node, chunk);
+ slotpos = node_125_find_unused_slot(node->base.isset);
Assert(slotpos < node->base.n.fanout);
node->base.slot_idxs[chunk] = slotpos;
@@ -763,12 +763,10 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
Assert(NODE_IS_LEAF(node));
- /* find unused slot */
- slotpos = node_leaf_125_find_unused_slot(node, chunk);
+ slotpos = node_125_find_unused_slot(node->base.isset);
Assert(slotpos < node->base.n.fanout);
node->base.slot_idxs[chunk] = slotpos;
- node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
node->values[slotpos] = value;
}
@@ -2395,9 +2393,9 @@ rt_dump_node(rt_node *node, int level, bool recurse)
rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
fprintf(stderr, ", isset-bitmap:");
- for (int i = 0; i < 16; i++)
+ for (int i = 0; i < WORDNUM(128); i++)
{
- fprintf(stderr, "%X ", (uint8) n->isset[i]);
+ fprintf(stderr, UINT64_FORMAT_HEX " ", n->base.isset[i]);
}
fprintf(stderr, "\n");
}
--
2.31.1
v15-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/octet-stream; name=v15-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload
From ceaf56be51d2c686a795e1ab1ab40f701ed21d62 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v15 1/9] introduce vector8_min and vector8_highbit_mask
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..0b288c422a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
static inline bool vector8_has_zero(const Vector8 v);
static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
#endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
#endif
}
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+ uint32 mask = 0;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+ return mask;
+#endif
+}
+
/*
* Exactly like vector8_is_highbit_set except for the input type, so it
* looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.31.1
v15-0003-Add-radix-implementation.patchapplication/octet-stream; name=v15-0003-Add-radix-implementation.patchDownload
From 6ba6c9979b2bd4fb5ef3c61d7a6edac1737e8509 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v15 3/9] Add radix implementation.
---
src/backend/lib/Makefile | 1 +
src/backend/lib/meson.build | 1 +
src/backend/lib/radixtree.c | 2541 +++++++++++++++++
src/include/lib/radixtree.h | 42 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 36 +
src/test/modules/test_radixtree/meson.build | 34 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 581 ++++
.../test_radixtree/test_radixtree.control | 4 +
15 files changed, 3291 insertions(+)
create mode 100644 src/backend/lib/radixtree.c
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
integerset.o \
knapsack.o \
pairingheap.o \
+ radixtree.o \
rbtree.o \
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 48da1bddce..4303d306cd 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -9,4 +9,5 @@ backend_sources += files(
'knapsack.c',
'pairingheap.c',
'rbtree.c',
+ 'radixtree.c',
)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..e7f61fd943
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2541 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves". We
+ * choose it to avoid an additional pointer traversal. It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create - Create a new, empty radix tree
+ * rt_free - Free the radix tree
+ * rt_search - Search a key-value pair
+ * rt_set - Set a key-value pair
+ * rt_delete - Delete a key-value pair
+ * rt_begin_iterate - Begin iterating through all key-value pairs
+ * rt_iterate_next - Return next key-value pair, if any
+ * rt_end_iter - End iteration
+ * rt_memory_usage - Get the memory usage
+ * rt_num_entries - Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+ RT_ACTION_FIND = 0, /* find the key-value */
+ RT_ACTION_DELETE, /* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of rt_node. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+typedef enum rt_size_class
+{
+ RT_CLASS_4_FULL = 0,
+ RT_CLASS_32_PARTIAL,
+ RT_CLASS_32_FULL,
+ RT_CLASS_125_FULL,
+ RT_CLASS_256
+
+#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
+} rt_size_class;
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /* Max number of children. We can use uint8 because we never need to store 256 */
+ /* WIP: if we don't have a variable sized node4, this should instead be in the base
+ types as needed, since saving every byte is crucial for the smallest node kind */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+ uint8 chunk;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} rt_node;
+#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+ ((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+ ((node)->base.n.count < rt_size_class_info[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct rt_node_base_4
+{
+ rt_node n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+ rt_node n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base125
+{
+ rt_node n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+} rt_node_base_125;
+
+typedef struct rt_node_base256
+{
+ rt_node n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ * width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+ rt_node_base_4 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+ rt_node_base_4 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+ rt_node_base_32 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+ rt_node_base_32 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_125
+{
+ rt_node_base_125 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_125;
+
+typedef struct rt_node_leaf_125
+{
+ rt_node_base_125 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(128)];
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+ rt_node_base_256 base;
+
+ /* Slots for 256 children */
+ rt_node *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+ rt_node_base_256 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ uint64 values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information for each size class */
+typedef struct rt_size_class_elem
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+
+ /* slab block size */
+ Size inner_blocksize;
+ Size leaf_blocksize;
+} rt_size_class_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
+ [RT_CLASS_4_FULL] = {
+ .name = "radix tree node 4",
+ .fanout = 4,
+ .inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_PARTIAL] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_FULL] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+ },
+ [RT_CLASS_125_FULL] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(rt_node_inner_256),
+ .leaf_size = sizeof(rt_node_leaf_256),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+ },
+};
+
+/* Map from the node kind to its minimum size class */
+static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
+ [RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+ [RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+ [RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+ [RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+ rt_node *node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+ radix_tree *tree;
+
+ /* Track the iteration on nodes of each level */
+ rt_node_iter stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ rt_node *root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+ bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+ rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+ uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+ uint8 *dst_chunks, rt_node **dst_children)
+{
+ const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(rt_node *) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+ uint8 *dst_chunks, uint64 *dst_values)
+{
+ const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(uint64) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(slot < node->base.n.fanout);
+ return (node->children[slot] != NULL);
+}
+
+static inline bool
+node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(slot < node->base.n.fanout);
+ return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+static inline rt_node *
+node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+static void
+node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[node->base.slot_idxs[chunk]] = NULL;
+ node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+static void
+node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
+{
+ int slotpos = node->base.slot_idxs[chunk];
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+ node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+/* Return an unused slot in node-125 */
+static int
+node_inner_125_find_unused_slot(rt_node_inner_125 *node, uint8 chunk)
+{
+ int slotpos = 0;
+
+ Assert(!NODE_IS_LEAF(node));
+ while (node_inner_125_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+static int
+node_leaf_125_find_unused_slot(rt_node_leaf_125 *node, uint8 chunk)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ /* We iterate over the isset bitmap per byte then check each bit */
+ for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+ {
+ if (node->isset[slotpos] < 0xFF)
+ break;
+ }
+ Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+ slotpos *= BITS_PER_BYTE;
+ while (node_leaf_125_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+static inline void
+node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+ int slotpos;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ /* find unused slot */
+ slotpos = node_inner_125_find_unused_slot(node, chunk);
+ Assert(slotpos < node->base.n.fanout);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ /* find unused slot */
+ slotpos = node_leaf_125_find_unused_slot(node, chunk);
+ Assert(slotpos < node->base.n.fanout);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+ node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+static inline void
+node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(node_inner_256_is_chunk_used(node, chunk));
+ return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(node_leaf_256_is_chunk_used(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+ int shift = key_get_shift(key);
+ bool inner = shift > 0;
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+ rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newnode->shift = shift;
+ tree->max_val = shift_get_max_val(shift);
+ tree->root = newnode;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
+{
+ rt_node *newnode;
+
+ if (inner)
+ newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+ rt_size_class_info[size_class].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ rt_size_class_info[size_class].leaf_size);
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[size_class]++;
+#endif
+
+ return newnode;
+}
+
+/* Initialize the node contents */
+static inline void
+rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+{
+ if (inner)
+ MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+ else
+ MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+
+ node->kind = kind;
+ node->fanout = rt_size_class_info[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+
+ memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+ }
+
+ /*
+ * Technically it's 256, but we cannot store that in a uint8,
+ * and this is the max size class to it will never grow.
+ */
+ if (kind == RT_NODE_KIND_256)
+ node->fanout = 0;
+}
+
+static inline void
+rt_copy_node(rt_node *newnode, rt_node *oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->chunk = oldnode->chunk;
+ newnode->count = oldnode->count;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node*
+rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+{
+ rt_node *newnode;
+ bool inner = !NODE_IS_LEAF(node);
+
+ newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
+ rt_init_node(newnode, new_kind, kind_min_size_class[new_kind], inner);
+ rt_copy_node(newnode, node);
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->root == node)
+ {
+ tree->root = NULL;
+ tree->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == rt_size_class_info[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->cnt[i]--;
+ Assert(tree->cnt[i] >= 0);
+ }
+#endif
+
+ pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+ rt_node *new_child, uint64 key)
+{
+ Assert(old_child->chunk == new_child->chunk);
+ Assert(old_child->shift == new_child->shift);
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new large node */
+ tree->root = new_child;
+ }
+ else
+ {
+ bool replaced PG_USED_FOR_ASSERTS_ONLY;
+
+ replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+ Assert(replaced);
+ }
+
+ rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+ int target_shift;
+ int shift = tree->root->shift + RT_NODE_SPAN;
+
+ target_shift = key_get_shift(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ rt_node_inner_4 *node;
+
+ node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+ rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+ node->base.n.shift = shift;
+ node->base.n.count = 1;
+ node->base.chunks[0] = 0;
+ node->children[0] = tree->root;
+
+ tree->root->chunk = 0;
+ tree->root = (rt_node *) node;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+ rt_node *node)
+{
+ int shift = node->shift;
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ rt_node *newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool inner = newshift > 0;
+
+ newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+ rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newchild->shift = newshift;
+ newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ rt_node_insert_inner(tree, parent, node, key, newchild);
+
+ parent = node;
+ node = newchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ rt_node_insert_leaf(tree, parent, node, key, value);
+ tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ rt_node *child = NULL;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = n4->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n4->base.chunks, n4->children,
+ n4->base.n.count, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = n32->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = node_inner_125_get_child(n125, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_125_delete(n125, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = node_inner_256_get_child(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ if (found && child_p)
+ *child_p = child;
+
+ return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ uint64 value = 0;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = n4->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+ n4->base.n.count, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = n32->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+ n32->base.n.count, idx);
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_125_get_value(n125, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_125_delete(n125, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_256_get_value(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ if (found && value_p)
+ *value_p = value;
+
+ return found;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+ rt_node *child)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->children[idx] = child;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ {
+ rt_node_inner_32 *new32;
+ Assert(parent != NULL);
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_children_array_copy(n4->base.chunks, n4->children,
+ new32->base.chunks, new32->children);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+ key);
+ node = (rt_node *) new32;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ uint16 count = n4->base.n.count;
+
+ /* shift chunks and children */
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n4->base.chunks, n4->children,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->children[insertpos] = child;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->children[idx] = child;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ {
+ Assert(parent != NULL);
+
+ if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+ {
+ /* use the same node kind, but expand to the next size class */
+ const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size;
+ const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+ rt_node_inner_32 *new32;
+
+ new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+ memcpy(new32, n32, size);
+ new32->base.n.fanout = fanout;
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+ /* must update both pointers here */
+ node = (rt_node *) new32;
+ n32 = new32;
+
+ goto retry_insert_inner_32;
+ }
+ else
+ {
+ rt_node_inner_125 *new125;
+
+ /* grow node from 32 to 125 */
+ new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+ RT_NODE_KIND_125);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
+ node = (rt_node *) new125;
+ }
+ }
+ else
+ {
+retry_insert_inner_32:
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int16 count = n32->base.n.count;
+
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n32->base.chunks, n32->children,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->children[insertpos] = child;
+ break;
+ }
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+ int cnt = 0;
+
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ node_inner_125_update(n125, chunk, child);
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ {
+ rt_node_inner_256 *new256;
+ Assert(parent != NULL);
+
+ /* grow node from 125 to 256 */
+ new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ continue;
+
+ node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
+ cnt++;
+ }
+
+ rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ else
+ {
+ node_inner_125_insert(n125, chunk, child);
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+ node_inner_256_set(n256, chunk, child);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(NODE_IS_LEAF(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->values[idx] = value;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ {
+ rt_node_leaf_32 *new32;
+ Assert(parent != NULL);
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_values_array_copy(n4->base.chunks, n4->values,
+ new32->base.chunks, new32->values);
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
+ node = (rt_node *) new32;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ int count = n4->base.n.count;
+
+ /* shift chunks and values */
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n4->base.chunks, n4->values,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->values[insertpos] = value;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->values[idx] = value;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ {
+ Assert(parent != NULL);
+
+ if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+ {
+ /* use the same node kind, but expand to the next size class */
+ const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
+ const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+ rt_node_leaf_32 *new32;
+
+ new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+ memcpy(new32, n32, size);
+ new32->base.n.fanout = fanout;
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+ /* must update both pointers here */
+ node = (rt_node *) new32;
+ n32 = new32;
+
+ goto retry_insert_leaf_32;
+ }
+ else
+ {
+ rt_node_leaf_125 *new125;
+
+ /* grow node from 32 to 125 */
+ new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+ RT_NODE_KIND_125);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
+ key);
+ node = (rt_node *) new125;
+ }
+ }
+ else
+ {
+ retry_insert_leaf_32:
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int count = n32->base.n.count;
+
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n32->base.chunks, n32->values,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->values[insertpos] = value;
+ break;
+ }
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+ int cnt = 0;
+
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ node_leaf_125_update(n125, chunk, value);
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ {
+ rt_node_leaf_256 *new256;
+ Assert(parent != NULL);
+
+ /* grow node from 125 to 256 */
+ new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ continue;
+
+ node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
+ cnt++;
+ }
+
+ rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ else
+ {
+ node_leaf_125_insert(n125, chunk, value);
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+ node_leaf_256_set(n256, chunk, value);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+ radix_tree *tree;
+ MemoryContext old_ctx;
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = palloc(sizeof(radix_tree));
+ tree->context = ctx;
+ tree->root = NULL;
+ tree->max_val = 0;
+ tree->num_keys = 0;
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].leaf_blocksize,
+ rt_size_class_info[i].leaf_size);
+#ifdef RT_DEBUG
+ tree->cnt[i] = 0;
+#endif
+ }
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+ int shift;
+ bool updated;
+ rt_node *node;
+ rt_node *parent;
+
+ /* Empty tree, create the root */
+ if (!tree->root)
+ rt_new_root(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->max_val)
+ rt_extend(tree, key);
+
+ Assert(tree->root);
+
+ shift = tree->root->shift;
+ node = parent = tree->root;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ {
+ rt_set_extend(tree, key, value, parent, node);
+ return false;
+ }
+
+ parent = node;
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->num_keys++;
+
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+ rt_node *node;
+ int shift;
+
+ Assert(value_p != NULL);
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ node = tree->root;
+ shift = tree->root->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ rt_node *stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ node = tree->root;
+ shift = tree->root->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ rt_node *child;
+
+ /* Push the current node to the stack */
+ stack[++level] = node;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ Assert(NODE_IS_LEAF(node));
+ deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (!NODE_IS_EMPTY(node))
+ return true;
+
+ /* Free the empty leaf node */
+ rt_free_node(tree, node);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ node = stack[level--];
+
+ deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!NODE_IS_EMPTY(node))
+ break;
+
+ /* The node became empty */
+ rt_free_node(tree, node);
+ }
+
+ return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+ MemoryContext old_ctx;
+ rt_iter *iter;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (rt_iter *) palloc0(sizeof(rt_iter));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree->root)
+ return iter;
+
+ top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+ int level = from;
+ rt_node *node = from_node;
+
+ for (;;)
+ {
+ rt_node_iter *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = rt_node_inner_iterate_next(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->root)
+ return false;
+
+ for (;;)
+ {
+ rt_node *child = NULL;
+ uint64 value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ rt_update_iter_stack(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+ pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+ rt_node *child = NULL;
+ bool found = false;
+ uint8 key_chunk;
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+
+ child = n4->children[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+ child = n32->children[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_125_get_child(n125, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_inner_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_256_get_child(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+ return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p)
+{
+ rt_node *node = node_iter->node;
+ bool found = false;
+ uint64 value;
+ uint8 key_chunk;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+
+ value = n4->values[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+ value = n32->values[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_125_get_value(n125, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_leaf_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_256_get_value(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ *value_p = value;
+ }
+
+ return found;
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+ return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+ Size total = sizeof(radix_tree);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+ for (int i = 1; i < n4->n.count; i++)
+ Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ if (NODE_IS_LEAF(node))
+ Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) node,
+ n125->slot_idxs[i]));
+ else
+ Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) node,
+ n125->slot_idxs[i]));
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+ cnt += pg_popcount32(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+ ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+ tree->num_keys,
+ tree->root->shift / RT_NODE_SPAN,
+ tree->cnt[RT_CLASS_4_FULL],
+ tree->cnt[RT_CLASS_32_PARTIAL],
+ tree->cnt[RT_CLASS_32_FULL],
+ tree->cnt[RT_CLASS_125_FULL],
+ tree->cnt[RT_CLASS_256])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+ char space[125] = {0};
+
+ fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
+ NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift, node->chunk);
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n4->base.chunks[i], n4->values[i]);
+ }
+ else
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n4->base.chunks[i]);
+
+ if (recurse)
+ rt_dump_node(n4->children[i], level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n32->base.chunks[i], n32->values[i]);
+ }
+ else
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ rt_dump_node(n32->children[i], level + 1, recurse);
+ }
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+
+ fprintf(stderr, "slot_idxs ");
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used(b125, i))
+ continue;
+
+ fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+ }
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+
+ fprintf(stderr, ", isset-bitmap:");
+ for (int i = 0; i < 16; i++)
+ {
+ fprintf(stderr, "%X ", (uint8) n->isset[i]);
+ }
+ fprintf(stderr, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used(b125, i))
+ continue;
+
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, node_leaf_125_get_value(n125, i));
+ }
+ else
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_125_get_child(n125, i),
+ level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, node_leaf_256_get_value(n256, i));
+ }
+ else
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+ recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+ tree->max_val, tree->max_val);
+
+ if (!tree->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->max_val)
+ {
+ elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+ key, key);
+ return;
+ }
+
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ rt_dump_node(node, level, false);
+
+ if (NODE_IS_LEAF(node))
+ {
+ uint64 dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+ break;
+ }
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_size,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].leaf_size,
+ rt_size_class_info[i].leaf_blocksize);
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+
+ if (!tree->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif /* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
test_pg_db_role_setting \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 911a768a29..fd101e3bf4 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -22,6 +22,7 @@ subdir('test_parser')
subdir('test_pg_db_role_setting')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..ea993e63df
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,581 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int rt_node_kind_fanouts[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 125, /* RT_NODE_KIND_125 */
+ 256 /* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ uint64 dummy;
+ uint64 key;
+ uint64 val;
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_set(radixtree, keys[i], keys[i] + 1))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ uint64 val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx - 1]
+ : rt_node_kind_fanouts[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx]
+ : rt_node_kind_fanouts[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ radix_tree *radixtree;
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+ radixtree = rt_create(radixtree_ctx);
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ test_basic(rt_node_kind_fanouts[i], false);
+ test_basic(rt_node_kind_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
--
2.31.1
v15-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchapplication/octet-stream; name=v15-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload
From caf11ea2ca608edac00443b6ab7590688385b0d4 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v15 2/9] Move some bitmap logic out of bitmapset.c
Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.
Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().
Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.
Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
src/backend/nodes/bitmapset.c | 36 ++------------------------------
src/include/nodes/bitmapset.h | 16 ++++++++++++--
src/include/port/pg_bitutils.h | 31 +++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 -
4 files changed, 47 insertions(+), 37 deletions(-)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index b7b274aeff..4384ff591d 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x))
/*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
{
int result;
- w = RIGHTMOST_ONE(w);
+ w = bmw_rightmost_one(w);
a->words[wordnum] &= ~w;
result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 2792281658..fdc504596b 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
#else
#define BITS_PER_BITMAPWORD 32
typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
#endif
@@ -75,6 +73,20 @@ typedef enum
BMS_MULTIPLE /* >1 member */
} BMS_Membership;
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w) pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
+#define bmw_popcount(w) pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w) pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
+#define bmw_popcount(w) pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
/*
* function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 814e0b2dba..f95b6afd86 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+ int32 result = (int32) word & -((int32) word);
+ return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+ int64 result = (int64) word & -((int64) word);
+ return (uint64) result;
+}
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 60c71d05fe..8305f09f2c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3654,7 +3654,6 @@ shmem_request_hook_type
shmem_startup_hook_type
sig_atomic_t
sigjmp_buf
-signedbitmapword
sigset_t
size_t
slist_head
--
2.31.1
On Mon, Dec 19, 2022 at 2:14 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
seems that they look at only memory that are actually dsa_allocate'd.
To be exact, we estimate the number of hash buckets based on work_mem
(and hash_mem_multiplier) and use it as the upper limit. So I've
confirmed that the result of dsa_get_total_size() could exceed the
limit. I'm not sure it's a known and legitimate usage. If we can
follow such usage, we can probably track how much dsa_allocate'd
memory is used in the radix tree.I've experimented with this idea. The newly added 0008 patch changes
the radix tree so that it counts the memory usage for both local and
shared cases. As shown below, there is an overhead for that:w/o 0008 patch
298453544 | 282
w/0 0008 patch
293603184 | 297
This adds about as much overhead as the improvement I measured in the v4
slab allocator patch. That's not acceptable, and is exactly what Andres
warned about in
/messages/by-id/20220704211822.kfxtzpcdmslzm2dy@awork3.anarazel.de
I'm guessing the hash join case can afford to be precise about memory
because it must spill to disk when exceeding workmem. We don't have that
design constraint.
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Dec 20, 2022 at 3:09 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Mon, Dec 19, 2022 at 2:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
seems that they look at only memory that are actually dsa_allocate'd.
To be exact, we estimate the number of hash buckets based on work_mem
(and hash_mem_multiplier) and use it as the upper limit. So I've
confirmed that the result of dsa_get_total_size() could exceed the
limit. I'm not sure it's a known and legitimate usage. If we can
follow such usage, we can probably track how much dsa_allocate'd
memory is used in the radix tree.I've experimented with this idea. The newly added 0008 patch changes
the radix tree so that it counts the memory usage for both local and
shared cases. As shown below, there is an overhead for that:w/o 0008 patch
298453544 | 282w/0 0008 patch
293603184 | 297This adds about as much overhead as the improvement I measured in the v4 slab allocator patch.
Oh, yes, that's bad.
/messages/by-id/20220704211822.kfxtzpcdmslzm2dy@awork3.anarazel.de
I'm guessing the hash join case can afford to be precise about memory because it must spill to disk when exceeding workmem. We don't have that design constraint.
You mean that the memory used by the radix tree should be limited not
by the amount of memory actually used, but by the amount of memory
allocated? In other words, it checks by MomoryContextMemAllocated() in
the local cases and by dsa_get_total_size() in the shared case.
The idea of using up to half of maintenance_work_mem might be a good
idea compared to the current flat-array solution. But since it only
uses half, I'm concerned that there will be users who double their
maintenace_work_mem. When it is improved, the user needs to restore
maintenance_work_mem again.
A better solution would be to have slab-like DSA. We allocate the
dynamic shared memory by adding fixed-length large segments. However,
downside would be since the segment size gets large we need to
increase maintenance_work_mem as well. Also, this patch set is already
getting bigger and more complicated, I don't think it's a good idea to
add more.
If we limit the memory usage by checking the amount of memory actually
used, we can use SlabStats() for the local cases. Since DSA doesn't
have such functionality for now we would need to add it. Or we can
track it in the radix tree only in the shared cases.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Wed, Dec 21, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Tue, Dec 20, 2022 at 3:09 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
/messages/by-id/20220704211822.kfxtzpcdmslzm2dy@awork3.anarazel.de
I'm guessing the hash join case can afford to be precise about memory
because it must spill to disk when exceeding workmem. We don't have that
design constraint.
You mean that the memory used by the radix tree should be limited not
by the amount of memory actually used, but by the amount of memory
allocated? In other words, it checks by MomoryContextMemAllocated() in
the local cases and by dsa_get_total_size() in the shared case.
I mean, if this patch set uses 10x less memory than v15 (not always, but
easy to find cases where it does), and if it's also expensive to track
memory use precisely, then we don't have an incentive to track memory
precisely. Even if we did, we don't want to assume that every future caller
of radix tree is willing to incur that cost.
The idea of using up to half of maintenance_work_mem might be a good
idea compared to the current flat-array solution. But since it only
uses half, I'm concerned that there will be users who double their
maintenace_work_mem. When it is improved, the user needs to restore
maintenance_work_mem again.
I find it useful to step back and look at the usage patterns:
Autovacuum: Limiting the memory allocated by vacuum is important, since
there are multiple workers and they can run at any time (possibly most of
the time). This case will not use parallel index vacuum, so will use slab,
where the quick estimation of memory taken by the context is not terribly
far off, so we can afford to be more optimistic here.
Manual vacuum: The default configuration assumes we want to finish as soon
as possible (vacuum_cost_delay is zero). Parallel index vacuum can be used.
My experience leads me to believe users are willing to use a lot of memory
to make manual vacuum finish as quickly as possible, and are disappointed
to learn that even if maintenance work mem is 10GB, vacuum can only use 1GB.
So I don't believe anyone will have to double maintenance work mem after
upgrading (even with pessimistic accounting) because we'll be both
- much more efficient with memory on average
- free from the 1GB cap
That said, it's possible 50% is too pessimistic -- a 75% threshold will
bring us very close to powers of two for example:
2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> keep going
766 + 256 = 1022MB -> stop
I'm not sure if that calculation could cause going over the limit, or how
common that would be.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Dec 22, 2022 at 7:24 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Wed, Dec 21, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Dec 20, 2022 at 3:09 PM John Naylor
<john.naylor@enterprisedb.com> wrote:/messages/by-id/20220704211822.kfxtzpcdmslzm2dy@awork3.anarazel.de
I'm guessing the hash join case can afford to be precise about memory because it must spill to disk when exceeding workmem. We don't have that design constraint.
You mean that the memory used by the radix tree should be limited not
by the amount of memory actually used, but by the amount of memory
allocated? In other words, it checks by MomoryContextMemAllocated() in
the local cases and by dsa_get_total_size() in the shared case.I mean, if this patch set uses 10x less memory than v15 (not always, but easy to find cases where it does), and if it's also expensive to track memory use precisely, then we don't have an incentive to track memory precisely. Even if we did, we don't want to assume that every future caller of radix tree is willing to incur that cost.
Understood.
The idea of using up to half of maintenance_work_mem might be a good
idea compared to the current flat-array solution. But since it only
uses half, I'm concerned that there will be users who double their
maintenace_work_mem. When it is improved, the user needs to restore
maintenance_work_mem again.I find it useful to step back and look at the usage patterns:
Autovacuum: Limiting the memory allocated by vacuum is important, since there are multiple workers and they can run at any time (possibly most of the time). This case will not use parallel index vacuum, so will use slab, where the quick estimation of memory taken by the context is not terribly far off, so we can afford to be more optimistic here.
Manual vacuum: The default configuration assumes we want to finish as soon as possible (vacuum_cost_delay is zero). Parallel index vacuum can be used. My experience leads me to believe users are willing to use a lot of memory to make manual vacuum finish as quickly as possible, and are disappointed to learn that even if maintenance work mem is 10GB, vacuum can only use 1GB.
Agreed.
So I don't believe anyone will have to double maintenance work mem after upgrading (even with pessimistic accounting) because we'll be both
- much more efficient with memory on average
- free from the 1GB cap
Make sense.
That said, it's possible 50% is too pessimistic -- a 75% threshold will bring us very close to powers of two for example:
2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> keep going
766 + 256 = 1022MB -> stopI'm not sure if that calculation could cause going over the limit, or how common that would be.
If the value is a power of 2, it seems to work perfectly fine. But for
example if it's 700MB, the total memory exceeds the limit:
2*(1+2+4+8+16+32+64+128) = 510MB (72.8% of 700MB) -> keep going
510 + 256 = 766MB -> stop but it exceeds the limit.
In a more bigger case, if it's 11000MB,
2*(1+2+...+2048) = 8190MB (74.4%)
8190 + 4096 = 12286MB
That being said, I don't think they are not common cases. So the 75%
threshold seems to work fine in most cases.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Thu, Dec 22, 2022 at 10:00 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
If the value is a power of 2, it seems to work perfectly fine. But for
example if it's 700MB, the total memory exceeds the limit:2*(1+2+4+8+16+32+64+128) = 510MB (72.8% of 700MB) -> keep going
510 + 256 = 766MB -> stop but it exceeds the limit.In a more bigger case, if it's 11000MB,
2*(1+2+...+2048) = 8190MB (74.4%)
8190 + 4096 = 12286MBThat being said, I don't think they are not common cases. So the 75%
threshold seems to work fine in most cases.
Thinking some more, I agree this doesn't have large practical risk, but
thinking from the point of view of the community, being loose with memory
limits by up to 10% is not a good precedent.
Perhaps we can be clever and use 75% when the limit is a power of two and
50% otherwise. I'm skeptical of trying to be clever, and I just thought of
an additional concern: We're assuming behavior of the growth in size of new
DSA segments, which could possibly change. Given how allocators are
typically coded, though, it seems safe to assume that they'll at most
double in size.
--
John Naylor
EDB: http://www.enterprisedb.com
I wrote:
- Try templating out the differences between local and shared memory.
Here is a brief progress report before Christmas vacation.
I thought the best way to approach this was to go "inside out", that is,
start with the modest goal of reducing duplicated code for v16.
0001-0005 are copies from v13.
0006 whacks around the rt_node_insert_inner function to reduce the "surface
area" as far as symbols and casts. This includes replacing the goto with an
extra "unlikely" branch.
0007 removes the STRICT pragma for one of our benchmark functions that
crept in somewhere -- it should use the default and not just return NULL
instantly.
0008 further whacks around the node-growing code in rt_node_insert_inner to
remove casts. When growing the size class within the same kind, we have no
need for a "new32" (etc) variable. Also, to keep from getting confused
about what an assert build verifies at the end, add a "newnode" variable
and assign it to "node" as soon as possible.
0009 uses the bitmap logic from 0004 for node256 also. There is no
performance reason for this, because there is no iteration needed, but it's
good for simplicity and consistency.
0010 and 0011 template a common implementation for both leaf and inner
nodes for searching and inserting.
0012: While at it, I couldn't resist using this technique to separate out
delete from search, which makes sense and might give a small performance
boost (at least on less capable hardware). I haven't got to the iteration
functions, but they should be straightforward.
There is more that could be done here, but I didn't want to get too ahead
of myself. For example, it's possible that struct members "children" and
"values" are names that don't need to be distinguished. Making them the
same would reduce code like
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = value;
+#else
+ n32->children[insertpos] = child;
+#endif
...but there could be downsides and I don't want to distract from the goal
of dealing with shared memory.
The tests pass, but it's not impossible that there is a new bug somewhere.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v16-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v16-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload
From 9661e7c32198fb77f3218cac7c444490d92f380f Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v16 02/12] Move some bitmap logic out of bitmapset.c
Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.
Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().
Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.
Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
src/backend/nodes/bitmapset.c | 36 ++------------------------------
src/include/nodes/bitmapset.h | 16 ++++++++++++--
src/include/port/pg_bitutils.h | 31 +++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 -
4 files changed, 47 insertions(+), 37 deletions(-)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index b7b274aeff..4384ff591d 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x))
/*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
{
int result;
- w = RIGHTMOST_ONE(w);
+ w = bmw_rightmost_one(w);
a->words[wordnum] &= ~w;
result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 2792281658..fdc504596b 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
#else
#define BITS_PER_BITMAPWORD 32
typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
#endif
@@ -75,6 +73,20 @@ typedef enum
BMS_MULTIPLE /* >1 member */
} BMS_Membership;
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w) pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
+#define bmw_popcount(w) pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w) pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
+#define bmw_popcount(w) pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
/*
* function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 814e0b2dba..f95b6afd86 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+ int32 result = (int32) word & -((int32) word);
+ return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+ int64 result = (int64) word & -((int64) word);
+ return (uint64) result;
+}
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 60c71d05fe..8305f09f2c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3654,7 +3654,6 @@ shmem_request_hook_type
shmem_startup_hook_type
sig_atomic_t
sigjmp_buf
-signedbitmapword
sigset_t
size_t
slist_head
--
2.38.1
v16-0001-introduce-vector8_min-and-vector8_highbit_mask.patchtext/x-patch; charset=US-ASCII; name=v16-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload
From f817851b80e4ec3fef4e5d9f32cc505c4d7f13f7 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v16 01/12] introduce vector8_min and vector8_highbit_mask
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..0b288c422a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
static inline bool vector8_has_zero(const Vector8 v);
static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
#endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
#endif
}
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+ uint32 mask = 0;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+ return mask;
+#endif
+}
+
/*
* Exactly like vector8_is_highbit_set except for the input type, so it
* looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.38.1
v16-0003-Add-radix-implementation.patchtext/x-patch; charset=US-ASCII; name=v16-0003-Add-radix-implementation.patchDownload
From 21751137cc807d4a9473f74ea287c8191dea5093 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v16 03/12] Add radix implementation.
---
src/backend/lib/Makefile | 1 +
src/backend/lib/meson.build | 1 +
src/backend/lib/radixtree.c | 2541 +++++++++++++++++
src/include/lib/radixtree.h | 42 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 36 +
src/test/modules/test_radixtree/meson.build | 34 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 581 ++++
.../test_radixtree/test_radixtree.control | 4 +
15 files changed, 3291 insertions(+)
create mode 100644 src/backend/lib/radixtree.c
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
integerset.o \
knapsack.o \
pairingheap.o \
+ radixtree.o \
rbtree.o \
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 0edddffacf..8193da105a 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -11,4 +11,5 @@ backend_sources += files(
'knapsack.c',
'pairingheap.c',
'rbtree.c',
+ 'radixtree.c',
)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..e7f61fd943
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2541 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves". We
+ * choose it to avoid an additional pointer traversal. It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create - Create a new, empty radix tree
+ * rt_free - Free the radix tree
+ * rt_search - Search a key-value pair
+ * rt_set - Set a key-value pair
+ * rt_delete - Delete a key-value pair
+ * rt_begin_iterate - Begin iterating through all key-value pairs
+ * rt_iterate_next - Return next key-value pair, if any
+ * rt_end_iter - End iteration
+ * rt_memory_usage - Get the memory usage
+ * rt_num_entries - Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+ RT_ACTION_FIND = 0, /* find the key-value */
+ RT_ACTION_DELETE, /* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of rt_node. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+typedef enum rt_size_class
+{
+ RT_CLASS_4_FULL = 0,
+ RT_CLASS_32_PARTIAL,
+ RT_CLASS_32_FULL,
+ RT_CLASS_125_FULL,
+ RT_CLASS_256
+
+#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
+} rt_size_class;
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /* Max number of children. We can use uint8 because we never need to store 256 */
+ /* WIP: if we don't have a variable sized node4, this should instead be in the base
+ types as needed, since saving every byte is crucial for the smallest node kind */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+ uint8 chunk;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} rt_node;
+#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+ ((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+ ((node)->base.n.count < rt_size_class_info[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct rt_node_base_4
+{
+ rt_node n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+ rt_node n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base125
+{
+ rt_node n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+} rt_node_base_125;
+
+typedef struct rt_node_base256
+{
+ rt_node n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ * width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+ rt_node_base_4 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+ rt_node_base_4 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+ rt_node_base_32 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+ rt_node_base_32 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_125
+{
+ rt_node_base_125 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_125;
+
+typedef struct rt_node_leaf_125
+{
+ rt_node_base_125 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(128)];
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+ rt_node_base_256 base;
+
+ /* Slots for 256 children */
+ rt_node *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+ rt_node_base_256 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ uint8 isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ uint64 values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information for each size class */
+typedef struct rt_size_class_elem
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+
+ /* slab block size */
+ Size inner_blocksize;
+ Size leaf_blocksize;
+} rt_size_class_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
+ [RT_CLASS_4_FULL] = {
+ .name = "radix tree node 4",
+ .fanout = 4,
+ .inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_PARTIAL] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_FULL] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+ },
+ [RT_CLASS_125_FULL] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(rt_node_inner_256),
+ .leaf_size = sizeof(rt_node_leaf_256),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+ },
+};
+
+/* Map from the node kind to its minimum size class */
+static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
+ [RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+ [RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+ [RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+ [RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+ rt_node *node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+ radix_tree *tree;
+
+ /* Track the iteration on nodes of each level */
+ rt_node_iter stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ rt_node *root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+ bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+ rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+ uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+ uint8 *dst_chunks, rt_node **dst_children)
+{
+ const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(rt_node *) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+ uint8 *dst_chunks, uint64 *dst_values)
+{
+ const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(uint64) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(slot < node->base.n.fanout);
+ return (node->children[slot] != NULL);
+}
+
+static inline bool
+node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(slot < node->base.n.fanout);
+ return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+static inline rt_node *
+node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+static void
+node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[node->base.slot_idxs[chunk]] = NULL;
+ node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+static void
+node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
+{
+ int slotpos = node->base.slot_idxs[chunk];
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+ node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+/* Return an unused slot in node-125 */
+static int
+node_inner_125_find_unused_slot(rt_node_inner_125 *node, uint8 chunk)
+{
+ int slotpos = 0;
+
+ Assert(!NODE_IS_LEAF(node));
+ while (node_inner_125_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+static int
+node_leaf_125_find_unused_slot(rt_node_leaf_125 *node, uint8 chunk)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ /* We iterate over the isset bitmap per byte then check each bit */
+ for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+ {
+ if (node->isset[slotpos] < 0xFF)
+ break;
+ }
+ Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+ slotpos *= BITS_PER_BYTE;
+ while (node_leaf_125_is_slot_used(node, slotpos))
+ slotpos++;
+
+ return slotpos;
+}
+
+static inline void
+node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+ int slotpos;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ /* find unused slot */
+ slotpos = node_inner_125_find_unused_slot(node, chunk);
+ Assert(slotpos < node->base.n.fanout);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ /* find unused slot */
+ slotpos = node_leaf_125_find_unused_slot(node, chunk);
+ Assert(slotpos < node->base.n.fanout);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+ node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+static inline void
+node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(node_inner_256_is_chunk_used(node, chunk));
+ return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(node_leaf_256_is_chunk_used(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+ int shift = key_get_shift(key);
+ bool inner = shift > 0;
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+ rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newnode->shift = shift;
+ tree->max_val = shift_get_max_val(shift);
+ tree->root = newnode;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
+{
+ rt_node *newnode;
+
+ if (inner)
+ newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+ rt_size_class_info[size_class].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ rt_size_class_info[size_class].leaf_size);
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[size_class]++;
+#endif
+
+ return newnode;
+}
+
+/* Initialize the node contents */
+static inline void
+rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+{
+ if (inner)
+ MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+ else
+ MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+
+ node->kind = kind;
+ node->fanout = rt_size_class_info[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+
+ memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+ }
+
+ /*
+ * Technically it's 256, but we cannot store that in a uint8,
+ * and this is the max size class to it will never grow.
+ */
+ if (kind == RT_NODE_KIND_256)
+ node->fanout = 0;
+}
+
+static inline void
+rt_copy_node(rt_node *newnode, rt_node *oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->chunk = oldnode->chunk;
+ newnode->count = oldnode->count;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node*
+rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+{
+ rt_node *newnode;
+ bool inner = !NODE_IS_LEAF(node);
+
+ newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
+ rt_init_node(newnode, new_kind, kind_min_size_class[new_kind], inner);
+ rt_copy_node(newnode, node);
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->root == node)
+ {
+ tree->root = NULL;
+ tree->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == rt_size_class_info[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->cnt[i]--;
+ Assert(tree->cnt[i] >= 0);
+ }
+#endif
+
+ pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+ rt_node *new_child, uint64 key)
+{
+ Assert(old_child->chunk == new_child->chunk);
+ Assert(old_child->shift == new_child->shift);
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new large node */
+ tree->root = new_child;
+ }
+ else
+ {
+ bool replaced PG_USED_FOR_ASSERTS_ONLY;
+
+ replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+ Assert(replaced);
+ }
+
+ rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+ int target_shift;
+ int shift = tree->root->shift + RT_NODE_SPAN;
+
+ target_shift = key_get_shift(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ rt_node_inner_4 *node;
+
+ node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+ rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+ node->base.n.shift = shift;
+ node->base.n.count = 1;
+ node->base.chunks[0] = 0;
+ node->children[0] = tree->root;
+
+ tree->root->chunk = 0;
+ tree->root = (rt_node *) node;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+ rt_node *node)
+{
+ int shift = node->shift;
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ rt_node *newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool inner = newshift > 0;
+
+ newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+ rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newchild->shift = newshift;
+ newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ rt_node_insert_inner(tree, parent, node, key, newchild);
+
+ parent = node;
+ node = newchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ rt_node_insert_leaf(tree, parent, node, key, value);
+ tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ rt_node *child = NULL;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = n4->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n4->base.chunks, n4->children,
+ n4->base.n.count, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = n32->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = node_inner_125_get_child(n125, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_125_delete(n125, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = node_inner_256_get_child(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ if (found && child_p)
+ *child_p = child;
+
+ return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ uint64 value = 0;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = n4->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+ n4->base.n.count, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = n32->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+ n32->base.n.count, idx);
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_125_get_value(n125, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_125_delete(n125, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_256_get_value(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ if (found && value_p)
+ *value_p = value;
+
+ return found;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+ rt_node *child)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->children[idx] = child;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ {
+ rt_node_inner_32 *new32;
+ Assert(parent != NULL);
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_children_array_copy(n4->base.chunks, n4->children,
+ new32->base.chunks, new32->children);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+ key);
+ node = (rt_node *) new32;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ uint16 count = n4->base.n.count;
+
+ /* shift chunks and children */
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n4->base.chunks, n4->children,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->children[insertpos] = child;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->children[idx] = child;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ {
+ Assert(parent != NULL);
+
+ if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+ {
+ /* use the same node kind, but expand to the next size class */
+ const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size;
+ const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+ rt_node_inner_32 *new32;
+
+ new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+ memcpy(new32, n32, size);
+ new32->base.n.fanout = fanout;
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+ /* must update both pointers here */
+ node = (rt_node *) new32;
+ n32 = new32;
+
+ goto retry_insert_inner_32;
+ }
+ else
+ {
+ rt_node_inner_125 *new125;
+
+ /* grow node from 32 to 125 */
+ new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+ RT_NODE_KIND_125);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
+ node = (rt_node *) new125;
+ }
+ }
+ else
+ {
+retry_insert_inner_32:
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int16 count = n32->base.n.count;
+
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n32->base.chunks, n32->children,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->children[insertpos] = child;
+ break;
+ }
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+ int cnt = 0;
+
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ node_inner_125_update(n125, chunk, child);
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ {
+ rt_node_inner_256 *new256;
+ Assert(parent != NULL);
+
+ /* grow node from 125 to 256 */
+ new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ continue;
+
+ node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
+ cnt++;
+ }
+
+ rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ else
+ {
+ node_inner_125_insert(n125, chunk, child);
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+ node_inner_256_set(n256, chunk, child);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(NODE_IS_LEAF(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->values[idx] = value;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ {
+ rt_node_leaf_32 *new32;
+ Assert(parent != NULL);
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_values_array_copy(n4->base.chunks, n4->values,
+ new32->base.chunks, new32->values);
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
+ node = (rt_node *) new32;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ int count = n4->base.n.count;
+
+ /* shift chunks and values */
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n4->base.chunks, n4->values,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->values[insertpos] = value;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->values[idx] = value;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ {
+ Assert(parent != NULL);
+
+ if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+ {
+ /* use the same node kind, but expand to the next size class */
+ const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
+ const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+ rt_node_leaf_32 *new32;
+
+ new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+ memcpy(new32, n32, size);
+ new32->base.n.fanout = fanout;
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+ /* must update both pointers here */
+ node = (rt_node *) new32;
+ n32 = new32;
+
+ goto retry_insert_leaf_32;
+ }
+ else
+ {
+ rt_node_leaf_125 *new125;
+
+ /* grow node from 32 to 125 */
+ new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+ RT_NODE_KIND_125);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
+ key);
+ node = (rt_node *) new125;
+ }
+ }
+ else
+ {
+ retry_insert_leaf_32:
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int count = n32->base.n.count;
+
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n32->base.chunks, n32->values,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->values[insertpos] = value;
+ break;
+ }
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+ int cnt = 0;
+
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ node_leaf_125_update(n125, chunk, value);
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ {
+ rt_node_leaf_256 *new256;
+ Assert(parent != NULL);
+
+ /* grow node from 125 to 256 */
+ new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ continue;
+
+ node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
+ cnt++;
+ }
+
+ rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ else
+ {
+ node_leaf_125_insert(n125, chunk, value);
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+ node_leaf_256_set(n256, chunk, value);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+ radix_tree *tree;
+ MemoryContext old_ctx;
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = palloc(sizeof(radix_tree));
+ tree->context = ctx;
+ tree->root = NULL;
+ tree->max_val = 0;
+ tree->num_keys = 0;
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].leaf_blocksize,
+ rt_size_class_info[i].leaf_size);
+#ifdef RT_DEBUG
+ tree->cnt[i] = 0;
+#endif
+ }
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+ int shift;
+ bool updated;
+ rt_node *node;
+ rt_node *parent;
+
+ /* Empty tree, create the root */
+ if (!tree->root)
+ rt_new_root(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->max_val)
+ rt_extend(tree, key);
+
+ Assert(tree->root);
+
+ shift = tree->root->shift;
+ node = parent = tree->root;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ {
+ rt_set_extend(tree, key, value, parent, node);
+ return false;
+ }
+
+ parent = node;
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->num_keys++;
+
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+ rt_node *node;
+ int shift;
+
+ Assert(value_p != NULL);
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ node = tree->root;
+ shift = tree->root->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ rt_node *stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ node = tree->root;
+ shift = tree->root->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ rt_node *child;
+
+ /* Push the current node to the stack */
+ stack[++level] = node;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ Assert(NODE_IS_LEAF(node));
+ deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (!NODE_IS_EMPTY(node))
+ return true;
+
+ /* Free the empty leaf node */
+ rt_free_node(tree, node);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ node = stack[level--];
+
+ deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!NODE_IS_EMPTY(node))
+ break;
+
+ /* The node became empty */
+ rt_free_node(tree, node);
+ }
+
+ return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+ MemoryContext old_ctx;
+ rt_iter *iter;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (rt_iter *) palloc0(sizeof(rt_iter));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree->root)
+ return iter;
+
+ top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+ int level = from;
+ rt_node *node = from_node;
+
+ for (;;)
+ {
+ rt_node_iter *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = rt_node_inner_iterate_next(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->root)
+ return false;
+
+ for (;;)
+ {
+ rt_node *child = NULL;
+ uint64 value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ rt_update_iter_stack(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+ pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+ rt_node *child = NULL;
+ bool found = false;
+ uint8 key_chunk;
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+
+ child = n4->children[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+ child = n32->children[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_125_get_child(n125, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_inner_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_256_get_child(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+ return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p)
+{
+ rt_node *node = node_iter->node;
+ bool found = false;
+ uint64 value;
+ uint8 key_chunk;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+
+ value = n4->values[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+ value = n32->values[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_125_get_value(n125, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_leaf_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_256_get_value(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ *value_p = value;
+ }
+
+ return found;
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+ return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+ Size total = sizeof(radix_tree);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+ for (int i = 1; i < n4->n.count; i++)
+ Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ if (NODE_IS_LEAF(node))
+ Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) node,
+ n125->slot_idxs[i]));
+ else
+ Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) node,
+ n125->slot_idxs[i]));
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+ cnt += pg_popcount32(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+ ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+ tree->num_keys,
+ tree->root->shift / RT_NODE_SPAN,
+ tree->cnt[RT_CLASS_4_FULL],
+ tree->cnt[RT_CLASS_32_PARTIAL],
+ tree->cnt[RT_CLASS_32_FULL],
+ tree->cnt[RT_CLASS_125_FULL],
+ tree->cnt[RT_CLASS_256])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+ char space[125] = {0};
+
+ fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
+ NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift, node->chunk);
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n4->base.chunks[i], n4->values[i]);
+ }
+ else
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n4->base.chunks[i]);
+
+ if (recurse)
+ rt_dump_node(n4->children[i], level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n32->base.chunks[i], n32->values[i]);
+ }
+ else
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ rt_dump_node(n32->children[i], level + 1, recurse);
+ }
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+
+ fprintf(stderr, "slot_idxs ");
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used(b125, i))
+ continue;
+
+ fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+ }
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+
+ fprintf(stderr, ", isset-bitmap:");
+ for (int i = 0; i < 16; i++)
+ {
+ fprintf(stderr, "%X ", (uint8) n->isset[i]);
+ }
+ fprintf(stderr, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used(b125, i))
+ continue;
+
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, node_leaf_125_get_value(n125, i));
+ }
+ else
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_125_get_child(n125, i),
+ level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, node_leaf_256_get_value(n256, i));
+ }
+ else
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+ recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+ tree->max_val, tree->max_val);
+
+ if (!tree->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->max_val)
+ {
+ elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+ key, key);
+ return;
+ }
+
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ rt_dump_node(node, level, false);
+
+ if (NODE_IS_LEAF(node))
+ {
+ uint64 dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+ break;
+ }
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_size,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].leaf_size,
+ rt_size_class_info[i].leaf_blocksize);
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+
+ if (!tree->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif /* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
test_pg_db_role_setting \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index eefc0b2063..2458ca64cc 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
subdir('test_pg_db_role_setting')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..ea993e63df
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,581 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int rt_node_kind_fanouts[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 125, /* RT_NODE_KIND_125 */
+ 256 /* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ uint64 dummy;
+ uint64 key;
+ uint64 val;
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_set(radixtree, keys[i], keys[i] + 1))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ uint64 val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx - 1]
+ : rt_node_kind_fanouts[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx]
+ : rt_node_kind_fanouts[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ radix_tree *radixtree;
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+ radixtree = rt_create(radixtree_ctx);
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ test_basic(rt_node_kind_fanouts[i], false);
+ test_basic(rt_node_kind_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
--
2.38.1
v16-0004-Use-bitmapword-for-node-125.patchtext/x-patch; charset=US-ASCII; name=v16-0004-Use-bitmapword-for-node-125.patchDownload
From deab00e6a99e42a8a96ac808dc0858d452bfd0e5 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 15:22:26 +0700
Subject: [PATCH v16 04/12] Use bitmapword for node-125
---
src/backend/lib/radixtree.c | 70 ++++++++++++++++++-------------------
1 file changed, 34 insertions(+), 36 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index e7f61fd943..abd0450727 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -62,6 +62,7 @@
#include "lib/radixtree.h"
#include "lib/stringinfo.h"
#include "miscadmin.h"
+#include "nodes/bitmapset.h"
#include "port/pg_bitutils.h"
#include "port/pg_lfind.h"
#include "utils/memutils.h"
@@ -103,6 +104,10 @@
#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+/* FIXME rename */
+#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD)
+
/* Enum used rt_node_search() */
typedef enum
{
@@ -207,6 +212,9 @@ typedef struct rt_node_base125
/* The index of slots for each fanout */
uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[WORDNUM(128)];
} rt_node_base_125;
typedef struct rt_node_base256
@@ -271,9 +279,6 @@ typedef struct rt_node_leaf_125
{
rt_node_base_125 base;
- /* isset is a bitmap to track which slot is in use */
- uint8 isset[RT_NODE_NSLOTS_BITS(128)];
-
/* number of values depends on size class */
uint64 values[FLEXIBLE_ARRAY_MEMBER];
} rt_node_leaf_125;
@@ -655,13 +660,14 @@ node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
}
+#ifdef USE_ASSERT_CHECKING
/* Is the slot in the node used? */
static inline bool
node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
{
Assert(!NODE_IS_LEAF(node));
Assert(slot < node->base.n.fanout);
- return (node->children[slot] != NULL);
+ return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
}
static inline bool
@@ -669,8 +675,9 @@ node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
{
Assert(NODE_IS_LEAF(node));
Assert(slot < node->base.n.fanout);
- return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+ return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
}
+#endif
static inline rt_node *
node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
@@ -690,7 +697,10 @@ node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
static void
node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
{
+ int slotpos = node->base.slot_idxs[chunk];
+
Assert(!NODE_IS_LEAF(node));
+ node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
node->children[node->base.slot_idxs[chunk]] = NULL;
node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
}
@@ -701,44 +711,35 @@ node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
int slotpos = node->base.slot_idxs[chunk];
Assert(NODE_IS_LEAF(node));
- node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+ node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
}
/* Return an unused slot in node-125 */
static int
-node_inner_125_find_unused_slot(rt_node_inner_125 *node, uint8 chunk)
-{
- int slotpos = 0;
-
- Assert(!NODE_IS_LEAF(node));
- while (node_inner_125_is_slot_used(node, slotpos))
- slotpos++;
-
- return slotpos;
-}
-
-static int
-node_leaf_125_find_unused_slot(rt_node_leaf_125 *node, uint8 chunk)
+node_125_find_unused_slot(bitmapword *isset)
{
int slotpos;
+ int idx;
+ bitmapword inverse;
- Assert(NODE_IS_LEAF(node));
-
- /* We iterate over the isset bitmap per byte then check each bit */
- for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < WORDNUM(128); idx++)
{
- if (node->isset[slotpos] < 0xFF)
+ if (isset[idx] < ~((bitmapword) 0))
break;
}
- Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
- slotpos *= BITS_PER_BYTE;
- while (node_leaf_125_is_slot_used(node, slotpos))
- slotpos++;
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+
+ /* mark the slot used */
+ isset[idx] |= bmw_rightmost_one(inverse);
return slotpos;
-}
+ }
static inline void
node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
@@ -747,8 +748,7 @@ node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
Assert(!NODE_IS_LEAF(node));
- /* find unused slot */
- slotpos = node_inner_125_find_unused_slot(node, chunk);
+ slotpos = node_125_find_unused_slot(node->base.isset);
Assert(slotpos < node->base.n.fanout);
node->base.slot_idxs[chunk] = slotpos;
@@ -763,12 +763,10 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
Assert(NODE_IS_LEAF(node));
- /* find unused slot */
- slotpos = node_leaf_125_find_unused_slot(node, chunk);
+ slotpos = node_125_find_unused_slot(node->base.isset);
Assert(slotpos < node->base.n.fanout);
node->base.slot_idxs[chunk] = slotpos;
- node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
node->values[slotpos] = value;
}
@@ -2395,9 +2393,9 @@ rt_dump_node(rt_node *node, int level, bool recurse)
rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
fprintf(stderr, ", isset-bitmap:");
- for (int i = 0; i < 16; i++)
+ for (int i = 0; i < WORDNUM(128); i++)
{
- fprintf(stderr, "%X ", (uint8) n->isset[i]);
+ fprintf(stderr, UINT64_FORMAT_HEX " ", n->base.isset[i]);
}
fprintf(stderr, "\n");
}
--
2.38.1
v16-0005-tool-for-measuring-radix-tree-performance.patchtext/x-patch; charset=US-ASCII; name=v16-0005-tool-for-measuring-radix-tree-performance.patchDownload
From 24859a28b554695d3c5f5e4b41b65375f666c765 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v16 05/12] tool for measuring radix tree performance
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 76 +++
contrib/bench_radix_tree/bench_radix_tree.c | 635 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
6 files changed, 767 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..83529805fc
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..a0693695e6
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,635 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 search_time_ms;
+ Datum values[2] = {0};
+ bool nulls[2] = {0};
+ /* from trial and error */
+ uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (!PG_ARGISNULL(1))
+ {
+ char *filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+ if (sscanf(filter_str, "0x%lX", &filter) == 0)
+ elog(ERROR, "invalid filter string %s", filter_str);
+ }
+ elog(NOTICE, "bench with filter 0x%lX", filter);
+
+ rt = rt_create(CurrentMemoryContext);
+
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+
+ rt_set(rt, key, key);
+ }
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_sparseload_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ uint64 r,
+ h;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ key_id = 0;
+
+ for (r = 1; r <= fanout; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (h);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+
+ rt_stats(rt);
+
+ /* measure sparse deletion and re-loading */
+ start_time = GetCurrentTimestamp();
+
+ for (int t = 0; t<10000; t++)
+ {
+ /* delete one key in each leaf */
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ rt_delete(rt, key);
+ }
+
+ /* add them all back */
+ key_id = 0;
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_sparseload_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
--
2.38.1
v16-0009-Use-bitmap-operations-for-isset-arrays-rather-th.patchtext/x-patch; charset=US-ASCII; name=v16-0009-Use-bitmap-operations-for-isset-arrays-rather-th.patchDownload
From e1085a0420f719f7e4ce8a904794ab8e484b75a9 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 19 Dec 2022 16:16:12 +0700
Subject: [PATCH v16 09/12] Use bitmap operations for isset arrays rather than
byte operations
It's simpler to do the same thing everywhere, even for node256
where iteration performance doesn't matter as much because we
always can insert directly.
Also rename WORDNUM and BITNUM to avoid clashing with bitmapset.c.
---
src/backend/lib/radixtree.c | 64 +++++++++++++++++++++----------------
1 file changed, 36 insertions(+), 28 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index ddf7b002fc..7899e844fb 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -77,12 +77,6 @@
/* The number of maximum slots in the node */
#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
-/*
- * Return the number of bits required to represent nslots slots, used
- * nodes indexed by array lookup.
- */
-#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
-
/* Mask for extracting a chunk from the key */
#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
@@ -98,15 +92,9 @@
/* Get a chunk from the key */
#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
-/*
- * Mapping from the value to the bit in is-set bitmap in the node-256.
- */
-#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
-#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
-
-/* FIXME rename */
-#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD)
-#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD)
+/* For accessing bitmaps */
+#define BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
+#define BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
/* Enum used rt_node_search() */
typedef enum
@@ -214,7 +202,7 @@ typedef struct rt_node_base125
uint8 slot_idxs[RT_NODE_MAX_SLOTS];
/* isset is a bitmap to track which slot is in use */
- bitmapword isset[WORDNUM(128)];
+ bitmapword isset[BM_IDX(128)];
} rt_node_base_125;
typedef struct rt_node_base256
@@ -300,7 +288,7 @@ typedef struct rt_node_leaf_256
rt_node_base_256 base;
/* isset is a bitmap to track which slot is in use */
- uint8 isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+ bitmapword isset[BM_IDX(RT_NODE_MAX_SLOTS)];
/* Slots for 256 values */
uint64 values[RT_NODE_MAX_SLOTS];
@@ -665,17 +653,23 @@ node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
static inline bool
node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
{
+ int idx = BM_IDX(slot);
+ int bitnum = BM_BIT(slot);
+
Assert(!NODE_IS_LEAF(node));
Assert(slot < node->base.n.fanout);
- return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
+ return (node->base.isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
}
static inline bool
node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
{
+ int idx = BM_IDX(slot);
+ int bitnum = BM_BIT(slot);
+
Assert(NODE_IS_LEAF(node));
Assert(slot < node->base.n.fanout);
- return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
+ return (node->base.isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
}
#endif
@@ -698,9 +692,12 @@ static void
node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
{
int slotpos = node->base.slot_idxs[chunk];
+ int idx = BM_IDX(slotpos);
+ int bitnum = BM_BIT(slotpos);
Assert(!NODE_IS_LEAF(node));
- node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
+
+ node->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
node->children[node->base.slot_idxs[chunk]] = NULL;
node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
}
@@ -709,9 +706,11 @@ static void
node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
{
int slotpos = node->base.slot_idxs[chunk];
+ int idx = BM_IDX(slotpos);
+ int bitnum = BM_BIT(slotpos);
Assert(NODE_IS_LEAF(node));
- node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
+ node->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
}
@@ -724,7 +723,7 @@ node_125_find_unused_slot(bitmapword *isset)
bitmapword inverse;
/* get the first word with at least one bit not set */
- for (idx = 0; idx < WORDNUM(128); idx++)
+ for (idx = 0; idx < BM_IDX(128); idx++)
{
if (isset[idx] < ~((bitmapword) 0))
break;
@@ -798,8 +797,11 @@ node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
static inline bool
node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
Assert(NODE_IS_LEAF(node));
- return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+ return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
}
static inline rt_node *
@@ -830,8 +832,11 @@ node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
static inline void
node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
Assert(NODE_IS_LEAF(node));
- node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+ node->isset[idx] |= ((bitmapword) 1 << bitnum);
node->values[chunk] = value;
}
@@ -846,8 +851,11 @@ node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
static inline void
node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
Assert(NODE_IS_LEAF(node));
- node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+ node->isset[idx] &= ~((bitmapword) 1 << bitnum);
}
/*
@@ -2269,8 +2277,8 @@ rt_verify_node(rt_node *node)
rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
int cnt = 0;
- for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
- cnt += pg_popcount32(n256->isset[i]);
+ for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
+ cnt += bmw_popcount(n256->isset[i]);
/* Check if the number of used chunk matches */
Assert(n256->base.n.count == cnt);
@@ -2386,7 +2394,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
fprintf(stderr, ", isset-bitmap:");
- for (int i = 0; i < WORDNUM(128); i++)
+ for (int i = 0; i < BM_IDX(128); i++)
{
fprintf(stderr, UINT64_FORMAT_HEX " ", n->base.isset[i]);
}
--
2.38.1
v16-0008-Use-newnode-variable-to-reduce-unnecessary-casti.patchtext/x-patch; charset=US-ASCII; name=v16-0008-Use-newnode-variable-to-reduce-unnecessary-casti.patchDownload
From 7c2652509034b569eb4fc49faaf4dd7a61bfa8fd Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 19 Dec 2022 15:08:15 +0700
Subject: [PATCH v16 08/12] Use newnode variable to reduce unnecessary casting
---
src/backend/lib/radixtree.c | 46 +++++++++++++++++--------------------
1 file changed, 21 insertions(+), 25 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 7c993e096b..ddf7b002fc 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -1284,6 +1284,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
{
uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
bool chunk_exists = false;
+ rt_node *newnode = NULL;
Assert(!NODE_IS_LEAF(node));
@@ -1306,18 +1307,16 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
{
rt_node_inner_32 *new32;
- Assert(parent != NULL);
/* grow node from 4 to 32 */
- new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+ new32 = (rt_node_inner_32 *) newnode;
chunk_children_array_copy(n4->base.chunks, n4->children,
new32->base.chunks, new32->children);
Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
- key);
- node = (rt_node *) new32;
+ rt_replace_node(tree, parent, node, newnode, key);
+ node = newnode;
}
else
{
@@ -1354,19 +1353,17 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
n32->base.n.count == minclass.fanout)
{
- /* use the same node kind, but expand to the next size class */
- rt_node_inner_32 *new32;
-
- new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
- memcpy(new32, n32, minclass.inner_size);
- new32->base.n.fanout = maxclass.fanout;
+ /* grow to the next size class of this kind */
+ newnode = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+ memcpy(newnode, node, minclass.inner_size);
+ newnode->fanout = maxclass.fanout;
Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+ rt_replace_node(tree, parent, node, newnode, key);
+ node = newnode;
- /* must update both pointers here */
- node = (rt_node *) new32;
- n32 = new32;
+ /* also update pointer for this kind */
+ n32 = (rt_node_inner_32 *) newnode;
}
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
@@ -1374,14 +1371,14 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
rt_node_inner_125 *new125;
/* grow node from 32 to 125 */
- new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
- RT_NODE_KIND_125);
+ newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+ new125 = (rt_node_inner_125 *) newnode;
for (int i = 0; i < n32->base.n.count; i++)
node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
Assert(parent != NULL);
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
- node = (rt_node *) new125;
+ rt_replace_node(tree, parent, node, newnode, key);
+ node = newnode;
}
else
{
@@ -1420,8 +1417,8 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
Assert(parent != NULL);
/* grow node from 125 to 256 */
- new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
- RT_NODE_KIND_256);
+ newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+ new256 = (rt_node_inner_256 *) newnode;
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
{
if (!node_125_is_chunk_used(&n125->base, i))
@@ -1431,9 +1428,8 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
cnt++;
}
- rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
- key);
- node = (rt_node *) new256;
+ rt_replace_node(tree, parent, node, newnode, key);
+ node = newnode;
}
else
{
--
2.38.1
v16-0010-Template-out-node-insert-functions.patchtext/x-patch; charset=US-ASCII; name=v16-0010-Template-out-node-insert-functions.patchDownload
From 48892a7f66892aeb3346622fd7b26e20811154d8 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 23 Dec 2022 14:33:49 +0700
Subject: [PATCH v16 10/12] Template out node insert functions
---
src/backend/lib/radixtree.c | 369 +-----------------------
src/include/lib/radixtree_insert_impl.h | 257 +++++++++++++++++
2 files changed, 263 insertions(+), 363 deletions(-)
create mode 100644 src/include/lib/radixtree_insert_impl.h
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 7899e844fb..79d12b27d2 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -1290,185 +1290,9 @@ static bool
rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
rt_node *child)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
- bool chunk_exists = false;
- rt_node *newnode = NULL;
-
- Assert(!NODE_IS_LEAF(node));
-
- switch (node->kind)
- {
- case RT_NODE_KIND_4:
- {
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
- int idx;
-
- idx = node_4_search_eq(&n4->base, chunk);
- if (idx != -1)
- {
- /* found the existing chunk */
- chunk_exists = true;
- n4->children[idx] = child;
- break;
- }
-
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
- {
- rt_node_inner_32 *new32;
-
- /* grow node from 4 to 32 */
- newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
- new32 = (rt_node_inner_32 *) newnode;
- chunk_children_array_copy(n4->base.chunks, n4->children,
- new32->base.chunks, new32->children);
-
- Assert(parent != NULL);
- rt_replace_node(tree, parent, node, newnode, key);
- node = newnode;
- }
- else
- {
- int insertpos = node_4_get_insertpos(&n4->base, chunk);
- uint16 count = n4->base.n.count;
-
- /* shift chunks and children */
- if (count != 0 && insertpos < count)
- chunk_children_array_shift(n4->base.chunks, n4->children,
- count, insertpos);
-
- n4->base.chunks[insertpos] = chunk;
- n4->children[insertpos] = child;
- break;
- }
- }
- /* FALLTHROUGH */
- case RT_NODE_KIND_32:
- {
- const rt_size_class_elem minclass = rt_size_class_info[RT_CLASS_32_PARTIAL];
- const rt_size_class_elem maxclass = rt_size_class_info[RT_CLASS_32_FULL];
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
- int idx;
-
- idx = node_32_search_eq(&n32->base, chunk);
- if (idx != -1)
- {
- /* found the existing chunk */
- chunk_exists = true;
- n32->children[idx] = child;
- break;
- }
-
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
- n32->base.n.count == minclass.fanout)
- {
- /* grow to the next size class of this kind */
- newnode = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
- memcpy(newnode, node, minclass.inner_size);
- newnode->fanout = maxclass.fanout;
-
- Assert(parent != NULL);
- rt_replace_node(tree, parent, node, newnode, key);
- node = newnode;
-
- /* also update pointer for this kind */
- n32 = (rt_node_inner_32 *) newnode;
- }
-
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
- {
- rt_node_inner_125 *new125;
-
- /* grow node from 32 to 125 */
- newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
- new125 = (rt_node_inner_125 *) newnode;
- for (int i = 0; i < n32->base.n.count; i++)
- node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
-
- Assert(parent != NULL);
- rt_replace_node(tree, parent, node, newnode, key);
- node = newnode;
- }
- else
- {
- int insertpos = node_32_get_insertpos(&n32->base, chunk);
- int16 count = n32->base.n.count;
-
- if (insertpos < count)
- {
- Assert(count > 0);
- chunk_children_array_shift(n32->base.chunks, n32->children,
- count, insertpos);
- }
-
- n32->base.chunks[insertpos] = chunk;
- n32->children[insertpos] = child;
- break;
- }
- }
- /* FALLTHROUGH */
- case RT_NODE_KIND_125:
- {
- rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
- int cnt = 0;
-
- if (node_125_is_chunk_used(&n125->base, chunk))
- {
- /* found the existing chunk */
- chunk_exists = true;
- node_inner_125_update(n125, chunk, child);
- break;
- }
-
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
- {
- rt_node_inner_256 *new256;
- Assert(parent != NULL);
-
- /* grow node from 125 to 256 */
- newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
- new256 = (rt_node_inner_256 *) newnode;
- for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
- {
- if (!node_125_is_chunk_used(&n125->base, i))
- continue;
-
- node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
- cnt++;
- }
-
- rt_replace_node(tree, parent, node, newnode, key);
- node = newnode;
- }
- else
- {
- node_inner_125_insert(n125, chunk, child);
- break;
- }
- }
- /* FALLTHROUGH */
- case RT_NODE_KIND_256:
- {
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
-
- chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
- Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
-
- node_inner_256_set(n256, chunk, child);
- break;
- }
- }
-
- /* Update statistics */
- if (!chunk_exists)
- node->count++;
-
- /*
- * Done. Finally, verify the chunk and value is inserted or replaced
- * properly in the node.
- */
- rt_verify_node(node);
-
- return chunk_exists;
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
}
/* Insert the value to the leaf node */
@@ -1476,190 +1300,9 @@ static bool
rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
uint64 key, uint64 value)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
- bool chunk_exists = false;
-
- Assert(NODE_IS_LEAF(node));
-
- switch (node->kind)
- {
- case RT_NODE_KIND_4:
- {
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
- int idx;
-
- idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
- if (idx != -1)
- {
- /* found the existing chunk */
- chunk_exists = true;
- n4->values[idx] = value;
- break;
- }
-
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
- {
- rt_node_leaf_32 *new32;
- Assert(parent != NULL);
-
- /* grow node from 4 to 32 */
- new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
- chunk_values_array_copy(n4->base.chunks, n4->values,
- new32->base.chunks, new32->values);
- rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
- node = (rt_node *) new32;
- }
- else
- {
- int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
- int count = n4->base.n.count;
-
- /* shift chunks and values */
- if (count != 0 && insertpos < count)
- chunk_values_array_shift(n4->base.chunks, n4->values,
- count, insertpos);
-
- n4->base.chunks[insertpos] = chunk;
- n4->values[insertpos] = value;
- break;
- }
- }
- /* FALLTHROUGH */
- case RT_NODE_KIND_32:
- {
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
- int idx;
-
- idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
- if (idx != -1)
- {
- /* found the existing chunk */
- chunk_exists = true;
- n32->values[idx] = value;
- break;
- }
-
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
- {
- Assert(parent != NULL);
-
- if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
- {
- /* use the same node kind, but expand to the next size class */
- const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
- const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
- rt_node_leaf_32 *new32;
-
- new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
- memcpy(new32, n32, size);
- new32->base.n.fanout = fanout;
-
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
-
- /* must update both pointers here */
- node = (rt_node *) new32;
- n32 = new32;
-
- goto retry_insert_leaf_32;
- }
- else
- {
- rt_node_leaf_125 *new125;
-
- /* grow node from 32 to 125 */
- new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
- RT_NODE_KIND_125);
- for (int i = 0; i < n32->base.n.count; i++)
- node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
-
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
- key);
- node = (rt_node *) new125;
- }
- }
- else
- {
- retry_insert_leaf_32:
- {
- int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
- int count = n32->base.n.count;
-
- if (count != 0 && insertpos < count)
- chunk_values_array_shift(n32->base.chunks, n32->values,
- count, insertpos);
-
- n32->base.chunks[insertpos] = chunk;
- n32->values[insertpos] = value;
- break;
- }
- }
- }
- /* FALLTHROUGH */
- case RT_NODE_KIND_125:
- {
- rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
- int cnt = 0;
-
- if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
- {
- /* found the existing chunk */
- chunk_exists = true;
- node_leaf_125_update(n125, chunk, value);
- break;
- }
-
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
- {
- rt_node_leaf_256 *new256;
- Assert(parent != NULL);
-
- /* grow node from 125 to 256 */
- new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
- RT_NODE_KIND_256);
- for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
- {
- if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
- continue;
-
- node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
- cnt++;
- }
-
- rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
- key);
- node = (rt_node *) new256;
- }
- else
- {
- node_leaf_125_insert(n125, chunk, value);
- break;
- }
- }
- /* FALLTHROUGH */
- case RT_NODE_KIND_256:
- {
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
-
- chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
- Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
-
- node_leaf_256_set(n256, chunk, value);
- break;
- }
- }
-
- /* Update statistics */
- if (!chunk_exists)
- node->count++;
-
- /*
- * Done. Finally, verify the chunk and value is inserted or replaced
- * properly in the node.
- */
- rt_verify_node(node);
-
- return chunk_exists;
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
}
/*
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..8e02c83fc7
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,257 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE rt_node_inner_4
+#define RT_NODE32_TYPE rt_node_inner_32
+#define RT_NODE125_TYPE rt_node_inner_125
+#define RT_NODE256_TYPE rt_node_inner_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE rt_node_leaf_4
+#define RT_NODE32_TYPE rt_node_leaf_32
+#define RT_NODE125_TYPE rt_node_leaf_125
+#define RT_NODE256_TYPE rt_node_leaf_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+ rt_node *newnode = NULL;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(NODE_IS_LEAF(node));
+#else
+ Assert(!NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ int idx;
+
+ idx = node_4_search_eq(&n4->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n4->values[idx] = value;
+#else
+ n4->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ {
+ RT_NODE32_TYPE *new32;
+
+ /* grow node from 4 to 32 */
+ newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+ new32 = (RT_NODE32_TYPE *) newnode;
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_values_array_copy(n4->base.chunks, n4->values,
+ new32->base.chunks, new32->values);
+#else
+ chunk_children_array_copy(n4->base.chunks, n4->children,
+ new32->base.chunks, new32->children);
+#endif
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, node, newnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos(&n4->base, chunk);
+ int count = n4->base.n.count;
+
+ /* shift chunks and children */
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_values_array_shift(n4->base.chunks, n4->values,
+ count, insertpos);
+#else
+ chunk_children_array_shift(n4->base.chunks, n4->children,
+ count, insertpos);
+#endif
+ }
+
+ n4->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n4->values[insertpos] = value;
+#else
+ n4->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ const rt_size_class_elem minclass = rt_size_class_info[RT_CLASS_32_PARTIAL];
+ const rt_size_class_elem maxclass = rt_size_class_info[RT_CLASS_32_FULL];
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx;
+
+ idx = node_32_search_eq(&n32->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[idx] = value;
+#else
+ n32->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+ n32->base.n.count == minclass.fanout)
+ {
+ /* grow to the next size class of this kind */
+ newnode = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+ memcpy(newnode, node, minclass.inner_size);
+ newnode->fanout = maxclass.fanout;
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, node, newnode, key);
+ node = newnode;
+
+ /* also update pointer for this kind */
+ n32 = (RT_NODE32_TYPE *) newnode;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ {
+ RT_NODE125_TYPE *new125;
+
+ /* grow node from 32 to 125 */
+ newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+ new125 = (RT_NODE125_TYPE *) newnode;
+ for (int i = 0; i < n32->base.n.count; i++)
+#ifdef RT_NODE_LEVEL_LEAF
+ node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
+#else
+ node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
+#endif
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, node, newnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = node_32_get_insertpos(&n32->base, chunk);
+ int count = n32->base.n.count;
+
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_values_array_shift(n32->base.chunks, n32->values,
+ count, insertpos);
+#else
+ chunk_children_array_shift(n32->base.chunks, n32->children,
+ count, insertpos);
+#endif
+ }
+
+ n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = value;
+#else
+ n32->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int cnt = 0;
+
+ if (node_125_is_chunk_used(&n125->base, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ node_leaf_125_update(n125, chunk, value);
+#else
+ node_inner_125_update(n125, chunk, child);
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ {
+ RT_NODE256_TYPE *new256;
+
+ /* grow node from 125 to 256 */
+ newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+ new256 = (RT_NODE256_TYPE *) newnode;
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!node_125_is_chunk_used(&n125->base, i))
+ continue;
+#ifdef RT_NODE_LEVEL_LEAF
+ node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
+#else
+ node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
+#endif
+ cnt++;
+ }
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, node, newnode, key);
+ node = newnode;
+ }
+ else
+ {
+#ifdef RT_NODE_LEVEL_LEAF
+ node_leaf_125_insert(n125, chunk, value);
+#else
+ node_inner_125_insert(n125, chunk, child);
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+#else
+ chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+#endif
+ Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+#ifdef RT_NODE_LEVEL_LEAF
+ node_leaf_256_set(n256, chunk, value);
+#else
+ node_inner_256_set(n256, chunk, child);
+#endif
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
--
2.38.1
v16-0007-Remove-STRICT-from-bench_search_random_nodes.patchtext/x-patch; charset=US-ASCII; name=v16-0007-Remove-STRICT-from-bench_search_random_nodes.patchDownload
From 42467662e039a9de6a0323d16857e9f17c5e140e Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 12 Dec 2022 10:39:48 +0700
Subject: [PATCH v16 07/12] Remove STRICT from bench_search_random_nodes
---
contrib/bench_radix_tree/bench_radix_tree--1.0.sql | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 83529805fc..2fd689aa91 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -50,7 +50,7 @@ OUT mem_allocated int8,
OUT search_ms int8)
returns record
as 'MODULE_PATHNAME'
-LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
create function bench_fixed_height_search(
fanout int4,
--
2.38.1
v16-0011-Template-out-node-search-functions.patchtext/x-patch; charset=US-ASCII; name=v16-0011-Template-out-node-search-functions.patchDownload
From a9982146efaa2c1b7139bc804ee33c0062be605a Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 23 Dec 2022 15:31:49 +0700
Subject: [PATCH v16 11/12] Template out node search functions
---
src/backend/lib/radixtree.c | 168 +-----------------------
src/include/lib/radixtree_search_impl.h | 151 +++++++++++++++++++++
2 files changed, 157 insertions(+), 162 deletions(-)
create mode 100644 src/include/lib/radixtree_search_impl.h
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 79d12b27d2..99450c96c8 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -1109,87 +1109,9 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
static inline bool
rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
- bool found = false;
- rt_node *child = NULL;
-
- switch (node->kind)
- {
- case RT_NODE_KIND_4:
- {
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
- int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
-
- if (idx < 0)
- break;
-
- found = true;
-
- if (action == RT_ACTION_FIND)
- child = n4->children[idx];
- else /* RT_ACTION_DELETE */
- chunk_children_array_delete(n4->base.chunks, n4->children,
- n4->base.n.count, idx);
-
- break;
- }
- case RT_NODE_KIND_32:
- {
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
- int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
-
- if (idx < 0)
- break;
-
- found = true;
- if (action == RT_ACTION_FIND)
- child = n32->children[idx];
- else /* RT_ACTION_DELETE */
- chunk_children_array_delete(n32->base.chunks, n32->children,
- n32->base.n.count, idx);
- break;
- }
- case RT_NODE_KIND_125:
- {
- rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
-
- if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
- break;
-
- found = true;
-
- if (action == RT_ACTION_FIND)
- child = node_inner_125_get_child(n125, chunk);
- else /* RT_ACTION_DELETE */
- node_inner_125_delete(n125, chunk);
-
- break;
- }
- case RT_NODE_KIND_256:
- {
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
-
- if (!node_inner_256_is_chunk_used(n256, chunk))
- break;
-
- found = true;
- if (action == RT_ACTION_FIND)
- child = node_inner_256_get_child(n256, chunk);
- else /* RT_ACTION_DELETE */
- node_inner_256_delete(n256, chunk);
-
- break;
- }
- }
-
- /* update statistics */
- if (action == RT_ACTION_DELETE && found)
- node->count--;
-
- if (found && child_p)
- *child_p = child;
-
- return found;
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
}
/*
@@ -1202,87 +1124,9 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
static inline bool
rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
- bool found = false;
- uint64 value = 0;
-
- switch (node->kind)
- {
- case RT_NODE_KIND_4:
- {
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
- int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
-
- if (idx < 0)
- break;
-
- found = true;
-
- if (action == RT_ACTION_FIND)
- value = n4->values[idx];
- else /* RT_ACTION_DELETE */
- chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
- n4->base.n.count, idx);
-
- break;
- }
- case RT_NODE_KIND_32:
- {
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
- int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
-
- if (idx < 0)
- break;
-
- found = true;
- if (action == RT_ACTION_FIND)
- value = n32->values[idx];
- else /* RT_ACTION_DELETE */
- chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
- n32->base.n.count, idx);
- break;
- }
- case RT_NODE_KIND_125:
- {
- rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
-
- if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
- break;
-
- found = true;
-
- if (action == RT_ACTION_FIND)
- value = node_leaf_125_get_value(n125, chunk);
- else /* RT_ACTION_DELETE */
- node_leaf_125_delete(n125, chunk);
-
- break;
- }
- case RT_NODE_KIND_256:
- {
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
-
- if (!node_leaf_256_is_chunk_used(n256, chunk))
- break;
-
- found = true;
- if (action == RT_ACTION_FIND)
- value = node_leaf_256_get_value(n256, chunk);
- else /* RT_ACTION_DELETE */
- node_leaf_256_delete(n256, chunk);
-
- break;
- }
- }
-
- /* update statistics */
- if (action == RT_ACTION_DELETE && found)
- node->count--;
-
- if (found && value_p)
- *value_p = value;
-
- return found;
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
}
/* Insert the child to the inner node */
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..0173d9cb2f
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,151 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE rt_node_inner_4
+#define RT_NODE32_TYPE rt_node_inner_32
+#define RT_NODE125_TYPE rt_node_inner_125
+#define RT_NODE256_TYPE rt_node_inner_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE rt_node_leaf_4
+#define RT_NODE32_TYPE rt_node_leaf_32
+#define RT_NODE125_TYPE rt_node_leaf_125
+#define RT_NODE256_TYPE rt_node_leaf_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ uint64 value = 0;
+#else
+ rt_node *child = NULL;
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n4->values[idx];
+#else
+ child = n4->children[idx];
+#endif
+ else /* RT_ACTION_DELETE */
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+ n4->base.n.count, idx);
+#else
+ chunk_children_array_delete(n4->base.chunks, n4->children,
+ n4->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[idx];
+#else
+ child = n32->children[idx];
+#endif
+ else /* RT_ACTION_DELETE */
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+ n32->base.n.count, idx);
+#else
+ chunk_children_array_delete(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+#ifdef RT_NODE_LEVEL_LEAF
+ value = node_leaf_125_get_value(n125, chunk);
+#else
+ child = node_inner_125_get_child(n125, chunk);
+#endif
+ else /* RT_ACTION_DELETE */
+#ifdef RT_NODE_LEVEL_LEAF
+ node_leaf_125_delete(n125, chunk);
+#else
+ node_inner_125_delete(n125, chunk);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!node_leaf_256_is_chunk_used(n256, chunk))
+#else
+ if (!node_inner_256_is_chunk_used(n256, chunk))
+#endif
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+#ifdef RT_NODE_LEVEL_LEAF
+ value = node_leaf_256_get_value(n256, chunk);
+#else
+ child = node_inner_256_get_child(n256, chunk);
+#endif
+ else /* RT_ACTION_DELETE */
+#ifdef RT_NODE_LEVEL_LEAF
+ node_leaf_256_delete(n256, chunk);
+#else
+ node_inner_256_delete(n256, chunk);
+#endif
+
+ break;
+ }
+ }
+
+ if (found)
+ {
+ /* update statistics */
+ if (action == RT_ACTION_DELETE)
+ node->count--;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ if (value_p)
+ *value_p = value;
+#else
+ if (child_p)
+ *child_p = child;
+#endif
+ }
+
+ return found;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
--
2.38.1
v16-0012-Separate-find-and-delete-actions-into-separate-f.patchtext/x-patch; charset=US-ASCII; name=v16-0012-Separate-find-and-delete-actions-into-separate-f.patchDownload
From 3637d74416d565e3ca6faff2ad6b6a25b2c50689 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 23 Dec 2022 17:41:05 +0700
Subject: [PATCH v16 12/12] Separate find and delete actions into separate
functions
This makes hot paths smaller and less branchy.
---
src/backend/lib/radixtree.c | 73 ++++++++++++++++---------
src/include/lib/radixtree_search_impl.h | 68 ++++++++++++-----------
2 files changed, 83 insertions(+), 58 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 99450c96c8..c934bff693 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -96,13 +96,6 @@
#define BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
#define BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
-/* Enum used rt_node_search() */
-typedef enum
-{
- RT_ACTION_FIND = 0, /* find the key-value */
- RT_ACTION_DELETE, /* delete the key-value */
-} rt_action;
-
/*
* Supported radix tree node kinds and size classes.
*
@@ -422,10 +415,8 @@ static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_cl
bool inner);
static void rt_free_node(radix_tree *tree, rt_node *node);
static void rt_extend(radix_tree *tree, uint64 key);
-static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
- rt_node **child_p);
-static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
- uint64 *value_p);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p);
static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
uint64 key, rt_node *child);
static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
@@ -1100,33 +1091,65 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
}
/*
- * Search for the child pointer corresponding to 'key' in the given node, and
- * do the specified 'action'.
+ * Search for the child pointer corresponding to 'key' in the given node.
*
* Return true if the key is found, otherwise return false. On success, the child
* pointer is set to child_p.
*/
static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p)
{
+#define RT_ACTION_FIND
#define RT_NODE_LEVEL_INNER
#include "lib/radixtree_search_impl.h"
#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_FIND
}
/*
- * Search for the value corresponding to 'key' in the given node, and do the
- * specified 'action'.
+ * Search for the value corresponding to 'key' in the given node.
*
* Return true if the key is found, otherwise return false. On success, the pointer
* to the value is set to value_p.
*/
static inline bool
-rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p)
+{
+#define RT_ACTION_FIND
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+#undef RT_ACTION_FIND
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+rt_node_delete_inner(rt_node *node, uint64 key)
+{
+#define RT_ACTION_DELETE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_DELETE
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+rt_node_delete_leaf(rt_node *node, uint64 key)
{
+#define RT_ACTION_DELETE
#define RT_NODE_LEVEL_LEAF
#include "lib/radixtree_search_impl.h"
#undef RT_NODE_LEVEL_LEAF
+#undef RT_ACTION_DELETE
}
/* Insert the child to the inner node */
@@ -1235,7 +1258,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
if (NODE_IS_LEAF(node))
break;
- if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ if (!rt_node_search_inner(node, key, &child))
{
rt_set_extend(tree, key, value, parent, node);
return false;
@@ -1282,14 +1305,14 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
if (NODE_IS_LEAF(node))
break;
- if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ if (!rt_node_search_inner(node, key, &child))
return false;
node = child;
shift -= RT_NODE_SPAN;
}
- return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+ return rt_node_search_leaf(node, key, value_p);
}
/*
@@ -1322,7 +1345,7 @@ rt_delete(radix_tree *tree, uint64 key)
/* Push the current node to the stack */
stack[++level] = node;
- if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ if (!rt_node_search_inner(node, key, &child))
return false;
node = child;
@@ -1331,7 +1354,7 @@ rt_delete(radix_tree *tree, uint64 key)
/* Delete the key from the leaf node if exists */
Assert(NODE_IS_LEAF(node));
- deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+ deleted = rt_node_delete_leaf(node, key);
if (!deleted)
{
@@ -1357,7 +1380,7 @@ rt_delete(radix_tree *tree, uint64 key)
{
node = stack[level--];
- deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+ deleted = rt_node_delete_inner(node, key);
Assert(deleted);
/* If the node didn't become empty, we stop deleting the key */
@@ -1989,12 +2012,12 @@ rt_dump_search(radix_tree *tree, uint64 key)
uint64 dummy;
/* We reached at a leaf node, find the corresponding slot */
- rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+ rt_node_search_leaf(node, key, &dummy);
break;
}
- if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ if (!rt_node_search_inner(node, key, &child))
break;
node = child;
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index 0173d9cb2f..28c02da2bf 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -10,16 +10,21 @@
#define RT_NODE256_TYPE rt_node_leaf_256
#else
#error node level must be either inner or leaf
+#endif
+
+#if !defined(RT_ACTION_FIND) && !defined(RT_ACTION_DELETE)
+#error search action must be either find or delete
#endif
uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
- bool found = false;
+#if defined(RT_ACTION_FIND)
#ifdef RT_NODE_LEVEL_LEAF
uint64 value = 0;
#else
rt_node *child = NULL;
#endif
+#endif /* RT_ACTION_FIND */
switch (node->kind)
{
@@ -29,17 +34,15 @@
int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
if (idx < 0)
- break;
-
- found = true;
+ return false;
- if (action == RT_ACTION_FIND)
+#if defined(RT_ACTION_FIND)
#ifdef RT_NODE_LEVEL_LEAF
value = n4->values[idx];
#else
child = n4->children[idx];
#endif
- else /* RT_ACTION_DELETE */
+#elif defined (RT_ACTION_DELETE)
#ifdef RT_NODE_LEVEL_LEAF
chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
n4->base.n.count, idx);
@@ -47,6 +50,8 @@
chunk_children_array_delete(n4->base.chunks, n4->children,
n4->base.n.count, idx);
#endif
+#endif /* RT_ACTION_FIND */
+
break;
}
case RT_NODE_KIND_32:
@@ -55,17 +60,15 @@
int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
if (idx < 0)
- break;
-
- found = true;
+ return false;
- if (action == RT_ACTION_FIND)
+#if defined(RT_ACTION_FIND)
#ifdef RT_NODE_LEVEL_LEAF
value = n32->values[idx];
#else
child = n32->children[idx];
#endif
- else /* RT_ACTION_DELETE */
+#elif defined (RT_ACTION_DELETE)
#ifdef RT_NODE_LEVEL_LEAF
chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
n32->base.n.count, idx);
@@ -73,6 +76,8 @@
chunk_children_array_delete(n32->base.chunks, n32->children,
n32->base.n.count, idx);
#endif
+#endif /* RT_ACTION_FIND */
+
break;
}
case RT_NODE_KIND_125:
@@ -80,22 +85,22 @@
RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
- break;
+ return false;
- found = true;
-
- if (action == RT_ACTION_FIND)
+#if defined(RT_ACTION_FIND)
#ifdef RT_NODE_LEVEL_LEAF
value = node_leaf_125_get_value(n125, chunk);
#else
child = node_inner_125_get_child(n125, chunk);
#endif
- else /* RT_ACTION_DELETE */
+#elif defined (RT_ACTION_DELETE)
#ifdef RT_NODE_LEVEL_LEAF
node_leaf_125_delete(n125, chunk);
#else
node_inner_125_delete(n125, chunk);
#endif
+#endif /* RT_ACTION_FIND */
+
break;
}
case RT_NODE_KIND_256:
@@ -107,43 +112,40 @@
#else
if (!node_inner_256_is_chunk_used(n256, chunk))
#endif
- break;
-
- found = true;
+ return false;
- if (action == RT_ACTION_FIND)
+#if defined(RT_ACTION_FIND)
#ifdef RT_NODE_LEVEL_LEAF
value = node_leaf_256_get_value(n256, chunk);
#else
child = node_inner_256_get_child(n256, chunk);
#endif
- else /* RT_ACTION_DELETE */
+#elif defined (RT_ACTION_DELETE)
#ifdef RT_NODE_LEVEL_LEAF
node_leaf_256_delete(n256, chunk);
#else
node_inner_256_delete(n256, chunk);
#endif
+#endif /* RT_ACTION_FIND */
break;
}
}
- if (found)
- {
- /* update statistics */
- if (action == RT_ACTION_DELETE)
- node->count--;
-
+#if defined(RT_ACTION_FIND)
#ifdef RT_NODE_LEVEL_LEAF
- if (value_p)
- *value_p = value;
+ Assert(value_p != NULL);
+ *value_p = value;
#else
- if (child_p)
- *child_p = child;
+ Assert(child_p != NULL);
+ *child_p = child;
#endif
- }
+#elif defined (RT_ACTION_DELETE)
+ /* update statistics */
+ node->count--;
+#endif /* RT_ACTION_FIND */
- return found;
+ return true;
#undef RT_NODE4_TYPE
#undef RT_NODE32_TYPE
--
2.38.1
v16-0006-Preparatory-refactoring-to-simplify-templating.patchtext/x-patch; charset=US-ASCII; name=v16-0006-Preparatory-refactoring-to-simplify-templating.patchDownload
From 4573327ba7fa1179389af4383c04053251a8bf73 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 11 Dec 2022 16:38:08 +0700
Subject: [PATCH v16 06/12] Preparatory refactoring to simplify templating
*Remove gotos and shorten const lookups in node_insert_inner()
*Turn condition into an assert
*Don't cast to base -- use membership
---
src/backend/lib/radixtree.c | 87 ++++++++++++++++++-------------------
1 file changed, 42 insertions(+), 45 deletions(-)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index abd0450727..7c993e096b 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -1294,7 +1294,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
int idx;
- idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ idx = node_4_search_eq(&n4->base, chunk);
if (idx != -1)
{
/* found the existing chunk */
@@ -1321,7 +1321,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
}
else
{
- int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ int insertpos = node_4_get_insertpos(&n4->base, chunk);
uint16 count = n4->base.n.count;
/* shift chunks and children */
@@ -1337,10 +1337,12 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
/* FALLTHROUGH */
case RT_NODE_KIND_32:
{
+ const rt_size_class_elem minclass = rt_size_class_info[RT_CLASS_32_PARTIAL];
+ const rt_size_class_elem maxclass = rt_size_class_info[RT_CLASS_32_FULL];
rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
int idx;
- idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ idx = node_32_search_eq(&n32->base, chunk);
if (idx != -1)
{
/* found the existing chunk */
@@ -1349,58 +1351,53 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
break;
}
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+ n32->base.n.count == minclass.fanout)
{
- Assert(parent != NULL);
+ /* use the same node kind, but expand to the next size class */
+ rt_node_inner_32 *new32;
- if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
- {
- /* use the same node kind, but expand to the next size class */
- const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size;
- const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
- rt_node_inner_32 *new32;
+ new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+ memcpy(new32, n32, minclass.inner_size);
+ new32->base.n.fanout = maxclass.fanout;
- new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
- memcpy(new32, n32, size);
- new32->base.n.fanout = fanout;
-
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
- /* must update both pointers here */
- node = (rt_node *) new32;
- n32 = new32;
+ /* must update both pointers here */
+ node = (rt_node *) new32;
+ n32 = new32;
+ }
- goto retry_insert_inner_32;
- }
- else
- {
- rt_node_inner_125 *new125;
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ {
+ rt_node_inner_125 *new125;
- /* grow node from 32 to 125 */
- new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
- RT_NODE_KIND_125);
- for (int i = 0; i < n32->base.n.count; i++)
- node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
+ /* grow node from 32 to 125 */
+ new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+ RT_NODE_KIND_125);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
- node = (rt_node *) new125;
- }
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
+ node = (rt_node *) new125;
}
else
{
-retry_insert_inner_32:
- {
- int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
- int16 count = n32->base.n.count;
+ int insertpos = node_32_get_insertpos(&n32->base, chunk);
+ int16 count = n32->base.n.count;
- if (count != 0 && insertpos < count)
- chunk_children_array_shift(n32->base.chunks, n32->children,
- count, insertpos);
-
- n32->base.chunks[insertpos] = chunk;
- n32->children[insertpos] = child;
- break;
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+ chunk_children_array_shift(n32->base.chunks, n32->children,
+ count, insertpos);
}
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->children[insertpos] = child;
+ break;
}
}
/* FALLTHROUGH */
@@ -1409,7 +1406,7 @@ retry_insert_inner_32:
rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
int cnt = 0;
- if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ if (node_125_is_chunk_used(&n125->base, chunk))
{
/* found the existing chunk */
chunk_exists = true;
@@ -1427,7 +1424,7 @@ retry_insert_inner_32:
RT_NODE_KIND_256);
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
{
- if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ if (!node_125_is_chunk_used(&n125->base, i))
continue;
node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
--
2.38.1
On Fri, Dec 23, 2022 at 8:47 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
I wrote:
- Try templating out the differences between local and shared memory.
Here is a brief progress report before Christmas vacation.
Thanks!
I thought the best way to approach this was to go "inside out", that is, start with the modest goal of reducing duplicated code for v16.
0001-0005 are copies from v13.
0006 whacks around the rt_node_insert_inner function to reduce the "surface area" as far as symbols and casts. This includes replacing the goto with an extra "unlikely" branch.
0007 removes the STRICT pragma for one of our benchmark functions that crept in somewhere -- it should use the default and not just return NULL instantly.
0008 further whacks around the node-growing code in rt_node_insert_inner to remove casts. When growing the size class within the same kind, we have no need for a "new32" (etc) variable. Also, to keep from getting confused about what an assert build verifies at the end, add a "newnode" variable and assign it to "node" as soon as possible.
0009 uses the bitmap logic from 0004 for node256 also. There is no performance reason for this, because there is no iteration needed, but it's good for simplicity and consistency.
These 4 patches make sense to me. We can merge them into 0002 patch
and I'll do similar changes for functions for leaf nodes as well.
0010 and 0011 template a common implementation for both leaf and inner nodes for searching and inserting.
0012: While at it, I couldn't resist using this technique to separate out delete from search, which makes sense and might give a small performance boost (at least on less capable hardware). I haven't got to the iteration functions, but they should be straightforward.
Cool!
There is more that could be done here, but I didn't want to get too ahead of myself. For example, it's possible that struct members "children" and "values" are names that don't need to be distinguished. Making them the same would reduce code like
+#ifdef RT_NODE_LEVEL_LEAF + n32->values[insertpos] = value; +#else + n32->children[insertpos] = child; +#endif...but there could be downsides and I don't want to distract from the goal of dealing with shared memory.
With these patches, some functions in radixtree.h load the header
files, radixtree_xxx_impl.h, that have the function body. What do you
think about how we can expand this template method to deal with DSA
memory? I imagined that we load say radixtree_template.h with some
macros to use the radix tree like we do for simplehash.h. And
radixtree_template.h further loads xxx_impl.h files for some internal
functions.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Tue, Dec 27, 2022 at 12:14 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Fri, Dec 23, 2022 at 8:47 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
These 4 patches make sense to me.We can merge them into 0002 patch
Okay, then I'll squash them when I post my next patch.
and I'll do similar changes for functions for leaf nodes as well.
I assume you meant something else? -- some of the differences between inner
and leaf are already abstracted away.
In any case, some things are still half-baked, so please wait until my next
patch before doing work on these files.
Also, CI found a bug on 32-bit -- I know what I missed and will fix next
week.
0010 and 0011 template a common implementation for both leaf and inner
nodes for searching and inserting.
0012: While at it, I couldn't resist using this technique to separate
out delete from search, which makes sense and might give a small
performance boost (at least on less capable hardware). I haven't got to the
iteration functions, but they should be straightforward.
Two things came to mind since I posted this, which I'll make clear next
patch:
- A good compiler will get rid of branches when inlining, so maybe no
difference in code generation, but it still looks nicer this way.
- Delete should really use its own template, because it only _accidentally_
looks like search because we don't yet shrink nodes.
What do you
think about how we can expand this template method to deal with DSA
memory? I imagined that we load say radixtree_template.h with some
macros to use the radix tree like we do for simplehash.h. And
radixtree_template.h further loads xxx_impl.h files for some internal
functions.
Right, I was thinking the same. I wanted to start small and look for
opportunities to shrink the code footprint.
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Dec 27, 2022 at 2:24 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Tue, Dec 27, 2022 at 12:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Dec 23, 2022 at 8:47 PM John Naylor
<john.naylor@enterprisedb.com> wrote:These 4 patches make sense to me.We can merge them into 0002 patch
Okay, then I'll squash them when I post my next patch.
and I'll do similar changes for functions for leaf nodes as well.
I assume you meant something else? -- some of the differences between inner and leaf are already abstracted away.
Right. If we template these routines I don't need that.
In any case, some things are still half-baked, so please wait until my next patch before doing work on these files.
Also, CI found a bug on 32-bit -- I know what I missed and will fix next week.
Thanks!
0010 and 0011 template a common implementation for both leaf and inner nodes for searching and inserting.
0012: While at it, I couldn't resist using this technique to separate out delete from search, which makes sense and might give a small performance boost (at least on less capable hardware). I haven't got to the iteration functions, but they should be straightforward.
Two things came to mind since I posted this, which I'll make clear next patch:
- A good compiler will get rid of branches when inlining, so maybe no difference in code generation, but it still looks nicer this way.
- Delete should really use its own template, because it only _accidentally_ looks like search because we don't yet shrink nodes.
Okay.
What do you
think about how we can expand this template method to deal with DSA
memory? I imagined that we load say radixtree_template.h with some
macros to use the radix tree like we do for simplehash.h. And
radixtree_template.h further loads xxx_impl.h files for some internal
functions.Right, I was thinking the same. I wanted to start small and look for opportunities to shrink the code footprint.
Thank you for your confirmation!
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
[working on templating]
In the end, I decided to base my effort on v8, and not v12 (based on one of
my less-well-thought-out ideas). The latter was a good experiment, but it
did not lead to an increase in readability as I had hoped. The attached v17
is still rough, but it's in good enough shape to evaluate a mostly-complete
templating implementation.
Part of what I didn't like about v8 was distinctions like "node" vs
"nodep", which hinder readability. I've used "allocnode" for some cases
where it makes sense, which is translated to "newnode" for the local
pointer. Some places I just gave up and used "nodep" for parameters like in
v8, just to get it done. We can revisit naming later.
Not done yet:
- get_handle() is not implemented
- rt_attach is defined but unused
- grow_node_kind() was hackishly removed, but could be turned into a macro
(or function that writes to 2 pointers)
- node_update_inner() is back, now that we can share a template with
"search". Seems easier to read, and I suspect this is easier for the
compiler.
- the value type should really be a template macro, but is still hard-coded
to uint64
- I think it's okay if the key is hard coded for PG16: If some use case
needs more than uint64, we could consider "single-value leaves" with varlen
keys as a template option.
- benchmark tests not updated
v13-0007 had some changes to the regression tests, but I haven't included
those. The tests from v13-0003 do pass, both locally and shared. I quickly
hacked together changing shared/local tests by hand (need to recompile),
but it would be good for maintainability if tests could run once each with
local and shmem, but use the same "expected" test output.
Also, I didn't look to see if there were any changes in v14/15 that didn't
have to do with precise memory accounting.
At this point, Masahiko, I'd appreciate your feedback on whether this is an
improvement at all (or at least a good base for improvement), especially
for integrating with the TID store. I think there are some advantages to
the template approach. One possible disadvantage is needing separate
functions for each local and shared memory.
If we go this route, I do think the TID store should invoke the template as
static functions. I'm not quite comfortable with a global function that may
not fit well with future use cases.
One review point I'll mention: Somehow I didn't notice there is no use for
the "chunk" field in the rt_node type -- it's only set to zero and copied
when growing. What is the purpose? Removing it would allow the
smallest node to take up only 32 bytes with a fanout of 3, by eliminating
padding.
Also, v17-0005 has an optimization/simplification for growing into node125
(my version needs an assertion or fallback, but works well now), found by
another reading of Andres' prototype There is a lot of good engineering
there, we should try to preserve it.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v17-0005-Template-out-inner-and-leaf-nodes.patchtext/x-patch; charset=US-ASCII; name=v17-0005-Template-out-inner-and-leaf-nodes.patchDownload
From b1de5cbacf06dd975cc2138a498c5d9897e14df7 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 23 Dec 2022 14:33:49 +0700
Subject: [PATCH v17 5/9] Template out inner and leaf nodes
Use a template for each insert, iteration, search, and
delete functions.
To optimize growing into node125, don't search for a
slot each time -- just copy into the first 32
slots and set the slot index at the same time.
Also set all the isset bits with a single store.
Remove node_*_125_update/insert/delete functions and
node_125_find_unused_slot, since they are now unused.
---
src/backend/lib/radixtree.c | 863 ++----------------------
src/include/lib/radixtree_delete_impl.h | 100 +++
src/include/lib/radixtree_insert_impl.h | 293 ++++++++
src/include/lib/radixtree_iter_impl.h | 129 ++++
src/include/lib/radixtree_search_impl.h | 102 +++
src/tools/pginclude/cpluspluscheck | 6 +
src/tools/pginclude/headerscheck | 6 +
7 files changed, 694 insertions(+), 805 deletions(-)
create mode 100644 src/include/lib/radixtree_delete_impl.h
create mode 100644 src/include/lib/radixtree_insert_impl.h
create mode 100644 src/include/lib/radixtree_iter_impl.h
create mode 100644 src/include/lib/radixtree_search_impl.h
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 5203127f76..80cde09aaf 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -96,13 +96,6 @@
#define BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
#define BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
-/* Enum used rt_node_search() */
-typedef enum
-{
- RT_ACTION_FIND = 0, /* find the key-value */
- RT_ACTION_DELETE, /* delete the key-value */
-} rt_action;
-
/*
* Supported radix tree node kinds and size classes.
*
@@ -422,10 +415,8 @@ static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_cl
bool inner);
static void rt_free_node(radix_tree *tree, rt_node *node);
static void rt_extend(radix_tree *tree, uint64 key);
-static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
- rt_node **child_p);
-static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
- uint64 *value_p);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p);
static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
uint64 key, rt_node *child);
static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
@@ -663,102 +654,6 @@ node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
return node->values[node->base.slot_idxs[chunk]];
}
-static void
-node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
-{
- int slotpos = node->base.slot_idxs[chunk];
- int idx = BM_IDX(slotpos);
- int bitnum = BM_BIT(slotpos);
-
- Assert(!NODE_IS_LEAF(node));
-
- node->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
- node->children[node->base.slot_idxs[chunk]] = NULL;
- node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
-}
-
-static void
-node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
-{
- int slotpos = node->base.slot_idxs[chunk];
- int idx = BM_IDX(slotpos);
- int bitnum = BM_BIT(slotpos);
-
- Assert(NODE_IS_LEAF(node));
- node->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
- node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
-}
-
-/* Return an unused slot in node-125 */
-static int
-node_125_find_unused_slot(bitmapword *isset)
-{
- int slotpos;
- int idx;
- bitmapword inverse;
-
- /* get the first word with at least one bit not set */
- for (idx = 0; idx < BM_IDX(128); idx++)
- {
- if (isset[idx] < ~((bitmapword) 0))
- break;
- }
-
- /* To get the first unset bit in X, get the first set bit in ~X */
- inverse = ~(isset[idx]);
- slotpos = idx * BITS_PER_BITMAPWORD;
- slotpos += bmw_rightmost_one_pos(inverse);
-
- /* mark the slot used */
- isset[idx] |= bmw_rightmost_one(inverse);
-
- return slotpos;
- }
-
-static inline void
-node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
-{
- int slotpos;
-
- Assert(!NODE_IS_LEAF(node));
-
- slotpos = node_125_find_unused_slot(node->base.isset);
- Assert(slotpos < node->base.n.fanout);
-
- node->base.slot_idxs[chunk] = slotpos;
- node->children[slotpos] = child;
-}
-
-/* Set the slot at the corresponding chunk */
-static inline void
-node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
-{
- int slotpos;
-
- Assert(NODE_IS_LEAF(node));
-
- slotpos = node_125_find_unused_slot(node->base.isset);
- Assert(slotpos < node->base.n.fanout);
-
- node->base.slot_idxs[chunk] = slotpos;
- node->values[slotpos] = value;
-}
-
-/* Update the child corresponding to 'chunk' to 'child' */
-static inline void
-node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
-{
- Assert(!NODE_IS_LEAF(node));
- node->children[node->base.slot_idxs[chunk]] = child;
-}
-
-static inline void
-node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
-{
- Assert(NODE_IS_LEAF(node));
- node->values[node->base.slot_idxs[chunk]] = value;
-}
-
/* Functions to manipulate inner and leaf node-256 */
/* Return true if the slot corresponding to the given chunk is in use */
@@ -1075,189 +970,57 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
}
/*
- * Search for the child pointer corresponding to 'key' in the given node, and
- * do the specified 'action'.
+ * Search for the child pointer corresponding to 'key' in the given node.
*
* Return true if the key is found, otherwise return false. On success, the child
* pointer is set to child_p.
*/
static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
- bool found = false;
- rt_node *child = NULL;
-
- switch (node->kind)
- {
- case RT_NODE_KIND_4:
- {
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
- int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
-
- if (idx < 0)
- break;
-
- found = true;
-
- if (action == RT_ACTION_FIND)
- child = n4->children[idx];
- else /* RT_ACTION_DELETE */
- chunk_children_array_delete(n4->base.chunks, n4->children,
- n4->base.n.count, idx);
-
- break;
- }
- case RT_NODE_KIND_32:
- {
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
- int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
-
- if (idx < 0)
- break;
-
- found = true;
- if (action == RT_ACTION_FIND)
- child = n32->children[idx];
- else /* RT_ACTION_DELETE */
- chunk_children_array_delete(n32->base.chunks, n32->children,
- n32->base.n.count, idx);
- break;
- }
- case RT_NODE_KIND_125:
- {
- rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
-
- if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
- break;
-
- found = true;
-
- if (action == RT_ACTION_FIND)
- child = node_inner_125_get_child(n125, chunk);
- else /* RT_ACTION_DELETE */
- node_inner_125_delete(n125, chunk);
-
- break;
- }
- case RT_NODE_KIND_256:
- {
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
-
- if (!node_inner_256_is_chunk_used(n256, chunk))
- break;
-
- found = true;
- if (action == RT_ACTION_FIND)
- child = node_inner_256_get_child(n256, chunk);
- else /* RT_ACTION_DELETE */
- node_inner_256_delete(n256, chunk);
-
- break;
- }
- }
-
- /* update statistics */
- if (action == RT_ACTION_DELETE && found)
- node->count--;
-
- if (found && child_p)
- *child_p = child;
-
- return found;
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
}
/*
- * Search for the value corresponding to 'key' in the given node, and do the
- * specified 'action'.
+ * Search for the value corresponding to 'key' in the given node.
*
* Return true if the key is found, otherwise return false. On success, the pointer
* to the value is set to value_p.
*/
static inline bool
-rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
- bool found = false;
- uint64 value = 0;
-
- switch (node->kind)
- {
- case RT_NODE_KIND_4:
- {
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
- int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
-
- if (idx < 0)
- break;
-
- found = true;
-
- if (action == RT_ACTION_FIND)
- value = n4->values[idx];
- else /* RT_ACTION_DELETE */
- chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
- n4->base.n.count, idx);
-
- break;
- }
- case RT_NODE_KIND_32:
- {
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
- int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
-
- if (idx < 0)
- break;
-
- found = true;
- if (action == RT_ACTION_FIND)
- value = n32->values[idx];
- else /* RT_ACTION_DELETE */
- chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
- n32->base.n.count, idx);
- break;
- }
- case RT_NODE_KIND_125:
- {
- rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
-
- if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
- break;
-
- found = true;
-
- if (action == RT_ACTION_FIND)
- value = node_leaf_125_get_value(n125, chunk);
- else /* RT_ACTION_DELETE */
- node_leaf_125_delete(n125, chunk);
-
- break;
- }
- case RT_NODE_KIND_256:
- {
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
-
- if (!node_leaf_256_is_chunk_used(n256, chunk))
- break;
-
- found = true;
- if (action == RT_ACTION_FIND)
- value = node_leaf_256_get_value(n256, chunk);
- else /* RT_ACTION_DELETE */
- node_leaf_256_delete(n256, chunk);
-
- break;
- }
- }
-
- /* update statistics */
- if (action == RT_ACTION_DELETE && found)
- node->count--;
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
- if (found && value_p)
- *value_p = value;
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+rt_node_delete_inner(rt_node *node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
- return found;
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+rt_node_delete_leaf(rt_node *node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
}
/* Insert the child to the inner node */
@@ -1265,185 +1028,9 @@ static bool
rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
rt_node *child)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
- bool chunk_exists = false;
- rt_node *newnode = NULL;
-
- Assert(!NODE_IS_LEAF(node));
-
- switch (node->kind)
- {
- case RT_NODE_KIND_4:
- {
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
- int idx;
-
- idx = node_4_search_eq(&n4->base, chunk);
- if (idx != -1)
- {
- /* found the existing chunk */
- chunk_exists = true;
- n4->children[idx] = child;
- break;
- }
-
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
- {
- rt_node_inner_32 *new32;
-
- /* grow node from 4 to 32 */
- newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
- new32 = (rt_node_inner_32 *) newnode;
- chunk_children_array_copy(n4->base.chunks, n4->children,
- new32->base.chunks, new32->children);
-
- Assert(parent != NULL);
- rt_replace_node(tree, parent, node, newnode, key);
- node = newnode;
- }
- else
- {
- int insertpos = node_4_get_insertpos(&n4->base, chunk);
- uint16 count = n4->base.n.count;
-
- /* shift chunks and children */
- if (count != 0 && insertpos < count)
- chunk_children_array_shift(n4->base.chunks, n4->children,
- count, insertpos);
-
- n4->base.chunks[insertpos] = chunk;
- n4->children[insertpos] = child;
- break;
- }
- }
- /* FALLTHROUGH */
- case RT_NODE_KIND_32:
- {
- const rt_size_class_elem minclass = rt_size_class_info[RT_CLASS_32_PARTIAL];
- const rt_size_class_elem maxclass = rt_size_class_info[RT_CLASS_32_FULL];
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
- int idx;
-
- idx = node_32_search_eq(&n32->base, chunk);
- if (idx != -1)
- {
- /* found the existing chunk */
- chunk_exists = true;
- n32->children[idx] = child;
- break;
- }
-
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
- n32->base.n.count == minclass.fanout)
- {
- /* grow to the next size class of this kind */
- newnode = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
- memcpy(newnode, node, minclass.inner_size);
- newnode->fanout = maxclass.fanout;
-
- Assert(parent != NULL);
- rt_replace_node(tree, parent, node, newnode, key);
- node = newnode;
-
- /* also update pointer for this kind */
- n32 = (rt_node_inner_32 *) newnode;
- }
-
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
- {
- rt_node_inner_125 *new125;
-
- /* grow node from 32 to 125 */
- newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
- new125 = (rt_node_inner_125 *) newnode;
- for (int i = 0; i < n32->base.n.count; i++)
- node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
-
- Assert(parent != NULL);
- rt_replace_node(tree, parent, node, newnode, key);
- node = newnode;
- }
- else
- {
- int insertpos = node_32_get_insertpos(&n32->base, chunk);
- int16 count = n32->base.n.count;
-
- if (insertpos < count)
- {
- Assert(count > 0);
- chunk_children_array_shift(n32->base.chunks, n32->children,
- count, insertpos);
- }
-
- n32->base.chunks[insertpos] = chunk;
- n32->children[insertpos] = child;
- break;
- }
- }
- /* FALLTHROUGH */
- case RT_NODE_KIND_125:
- {
- rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
- int cnt = 0;
-
- if (node_125_is_chunk_used(&n125->base, chunk))
- {
- /* found the existing chunk */
- chunk_exists = true;
- node_inner_125_update(n125, chunk, child);
- break;
- }
-
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
- {
- rt_node_inner_256 *new256;
- Assert(parent != NULL);
-
- /* grow node from 125 to 256 */
- newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
- new256 = (rt_node_inner_256 *) newnode;
- for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
- {
- if (!node_125_is_chunk_used(&n125->base, i))
- continue;
-
- node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
- cnt++;
- }
-
- rt_replace_node(tree, parent, node, newnode, key);
- node = newnode;
- }
- else
- {
- node_inner_125_insert(n125, chunk, child);
- break;
- }
- }
- /* FALLTHROUGH */
- case RT_NODE_KIND_256:
- {
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
-
- chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
- Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
-
- node_inner_256_set(n256, chunk, child);
- break;
- }
- }
-
- /* Update statistics */
- if (!chunk_exists)
- node->count++;
-
- /*
- * Done. Finally, verify the chunk and value is inserted or replaced
- * properly in the node.
- */
- rt_verify_node(node);
-
- return chunk_exists;
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
}
/* Insert the value to the leaf node */
@@ -1451,190 +1038,9 @@ static bool
rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
uint64 key, uint64 value)
{
- uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
- bool chunk_exists = false;
-
- Assert(NODE_IS_LEAF(node));
-
- switch (node->kind)
- {
- case RT_NODE_KIND_4:
- {
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
- int idx;
-
- idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
- if (idx != -1)
- {
- /* found the existing chunk */
- chunk_exists = true;
- n4->values[idx] = value;
- break;
- }
-
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
- {
- rt_node_leaf_32 *new32;
- Assert(parent != NULL);
-
- /* grow node from 4 to 32 */
- new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
- chunk_values_array_copy(n4->base.chunks, n4->values,
- new32->base.chunks, new32->values);
- rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
- node = (rt_node *) new32;
- }
- else
- {
- int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
- int count = n4->base.n.count;
-
- /* shift chunks and values */
- if (count != 0 && insertpos < count)
- chunk_values_array_shift(n4->base.chunks, n4->values,
- count, insertpos);
-
- n4->base.chunks[insertpos] = chunk;
- n4->values[insertpos] = value;
- break;
- }
- }
- /* FALLTHROUGH */
- case RT_NODE_KIND_32:
- {
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
- int idx;
-
- idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
- if (idx != -1)
- {
- /* found the existing chunk */
- chunk_exists = true;
- n32->values[idx] = value;
- break;
- }
-
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
- {
- Assert(parent != NULL);
-
- if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
- {
- /* use the same node kind, but expand to the next size class */
- const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
- const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
- rt_node_leaf_32 *new32;
-
- new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
- memcpy(new32, n32, size);
- new32->base.n.fanout = fanout;
-
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
-
- /* must update both pointers here */
- node = (rt_node *) new32;
- n32 = new32;
-
- goto retry_insert_leaf_32;
- }
- else
- {
- rt_node_leaf_125 *new125;
-
- /* grow node from 32 to 125 */
- new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
- RT_NODE_KIND_125);
- for (int i = 0; i < n32->base.n.count; i++)
- node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
-
- rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
- key);
- node = (rt_node *) new125;
- }
- }
- else
- {
- retry_insert_leaf_32:
- {
- int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
- int count = n32->base.n.count;
-
- if (count != 0 && insertpos < count)
- chunk_values_array_shift(n32->base.chunks, n32->values,
- count, insertpos);
-
- n32->base.chunks[insertpos] = chunk;
- n32->values[insertpos] = value;
- break;
- }
- }
- }
- /* FALLTHROUGH */
- case RT_NODE_KIND_125:
- {
- rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
- int cnt = 0;
-
- if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
- {
- /* found the existing chunk */
- chunk_exists = true;
- node_leaf_125_update(n125, chunk, value);
- break;
- }
-
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
- {
- rt_node_leaf_256 *new256;
- Assert(parent != NULL);
-
- /* grow node from 125 to 256 */
- new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
- RT_NODE_KIND_256);
- for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
- {
- if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
- continue;
-
- node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
- cnt++;
- }
-
- rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
- key);
- node = (rt_node *) new256;
- }
- else
- {
- node_leaf_125_insert(n125, chunk, value);
- break;
- }
- }
- /* FALLTHROUGH */
- case RT_NODE_KIND_256:
- {
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
-
- chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
- Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
-
- node_leaf_256_set(n256, chunk, value);
- break;
- }
- }
-
- /* Update statistics */
- if (!chunk_exists)
- node->count++;
-
- /*
- * Done. Finally, verify the chunk and value is inserted or replaced
- * properly in the node.
- */
- rt_verify_node(node);
-
- return chunk_exists;
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
}
/*
@@ -1723,7 +1129,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
if (NODE_IS_LEAF(node))
break;
- if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ if (!rt_node_search_inner(node, key, &child))
{
rt_set_extend(tree, key, value, parent, node);
return false;
@@ -1770,14 +1176,14 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
if (NODE_IS_LEAF(node))
break;
- if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ if (!rt_node_search_inner(node, key, &child))
return false;
node = child;
shift -= RT_NODE_SPAN;
}
- return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+ return rt_node_search_leaf(node, key, value_p);
}
/*
@@ -1810,7 +1216,7 @@ rt_delete(radix_tree *tree, uint64 key)
/* Push the current node to the stack */
stack[++level] = node;
- if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ if (!rt_node_search_inner(node, key, &child))
return false;
node = child;
@@ -1819,7 +1225,7 @@ rt_delete(radix_tree *tree, uint64 key)
/* Delete the key from the leaf node if exists */
Assert(NODE_IS_LEAF(node));
- deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+ deleted = rt_node_delete_leaf(node, key);
if (!deleted)
{
@@ -1845,7 +1251,7 @@ rt_delete(radix_tree *tree, uint64 key)
{
node = stack[level--];
- deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+ deleted = rt_node_delete_inner(node, key);
Assert(deleted);
/* If the node didn't become empty, we stop deleting the key */
@@ -1994,84 +1400,9 @@ rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
static inline rt_node *
rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
{
- rt_node *child = NULL;
- bool found = false;
- uint8 key_chunk;
-
- switch (node_iter->node->kind)
- {
- case RT_NODE_KIND_4:
- {
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
-
- node_iter->current_idx++;
- if (node_iter->current_idx >= n4->base.n.count)
- break;
-
- child = n4->children[node_iter->current_idx];
- key_chunk = n4->base.chunks[node_iter->current_idx];
- found = true;
- break;
- }
- case RT_NODE_KIND_32:
- {
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
-
- node_iter->current_idx++;
- if (node_iter->current_idx >= n32->base.n.count)
- break;
-
- child = n32->children[node_iter->current_idx];
- key_chunk = n32->base.chunks[node_iter->current_idx];
- found = true;
- break;
- }
- case RT_NODE_KIND_125:
- {
- rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
- int i;
-
- for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
- {
- if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
- break;
- }
-
- if (i >= RT_NODE_MAX_SLOTS)
- break;
-
- node_iter->current_idx = i;
- child = node_inner_125_get_child(n125, i);
- key_chunk = i;
- found = true;
- break;
- }
- case RT_NODE_KIND_256:
- {
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
- int i;
-
- for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
- {
- if (node_inner_256_is_chunk_used(n256, i))
- break;
- }
-
- if (i >= RT_NODE_MAX_SLOTS)
- break;
-
- node_iter->current_idx = i;
- child = node_inner_256_get_child(n256, i);
- key_chunk = i;
- found = true;
- break;
- }
- }
-
- if (found)
- rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
-
- return child;
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
}
/*
@@ -2082,88 +1413,9 @@ static inline bool
rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
uint64 *value_p)
{
- rt_node *node = node_iter->node;
- bool found = false;
- uint64 value;
- uint8 key_chunk;
-
- switch (node->kind)
- {
- case RT_NODE_KIND_4:
- {
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
-
- node_iter->current_idx++;
- if (node_iter->current_idx >= n4->base.n.count)
- break;
-
- value = n4->values[node_iter->current_idx];
- key_chunk = n4->base.chunks[node_iter->current_idx];
- found = true;
- break;
- }
- case RT_NODE_KIND_32:
- {
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
-
- node_iter->current_idx++;
- if (node_iter->current_idx >= n32->base.n.count)
- break;
-
- value = n32->values[node_iter->current_idx];
- key_chunk = n32->base.chunks[node_iter->current_idx];
- found = true;
- break;
- }
- case RT_NODE_KIND_125:
- {
- rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
- int i;
-
- for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
- {
- if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
- break;
- }
-
- if (i >= RT_NODE_MAX_SLOTS)
- break;
-
- node_iter->current_idx = i;
- value = node_leaf_125_get_value(n125, i);
- key_chunk = i;
- found = true;
- break;
- }
- case RT_NODE_KIND_256:
- {
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
- int i;
-
- for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
- {
- if (node_leaf_256_is_chunk_used(n256, i))
- break;
- }
-
- if (i >= RT_NODE_MAX_SLOTS)
- break;
-
- node_iter->current_idx = i;
- value = node_leaf_256_get_value(n256, i);
- key_chunk = i;
- found = true;
- break;
- }
- }
-
- if (found)
- {
- rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
- *value_p = value;
- }
-
- return found;
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
}
/*
@@ -2229,6 +1481,7 @@ rt_verify_node(rt_node *node)
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
{
uint8 slot = n125->slot_idxs[i];
+ int idx = BM_IDX(slot);
int bitnum = BM_BIT(slot);
if (!node_125_is_chunk_used(n125, i))
@@ -2236,7 +1489,7 @@ rt_verify_node(rt_node *node)
/* Check if the corresponding slot is used */
Assert(slot < node->fanout);
- Assert((n125->isset[i] & ((bitmapword) 1 << bitnum)) != 0);
+ Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
cnt++;
}
@@ -2476,12 +1729,12 @@ rt_dump_search(radix_tree *tree, uint64 key)
uint64 dummy;
/* We reached at a leaf node, find the corresponding slot */
- rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+ rt_node_search_leaf(node, key, &dummy);
break;
}
- if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ if (!rt_node_search_inner(node, key, &child))
break;
node = child;
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..24fd9cc02b
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,100 @@
+/* TODO: shrink nodes */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE rt_node_inner_4
+#define RT_NODE32_TYPE rt_node_inner_32
+#define RT_NODE125_TYPE rt_node_inner_125
+#define RT_NODE256_TYPE rt_node_inner_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE rt_node_leaf_4
+#define RT_NODE32_TYPE rt_node_leaf_32
+#define RT_NODE125_TYPE rt_node_leaf_125
+#define RT_NODE256_TYPE rt_node_leaf_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+ n4->base.n.count, idx);
+#else
+ chunk_children_array_delete(n4->base.chunks, n4->children,
+ n4->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+ n32->base.n.count, idx);
+#else
+ chunk_children_array_delete(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int idx;
+ int bitnum;
+
+ if (slotpos == RT_NODE_125_INVALID_IDX)
+ return false;
+
+ idx = BM_IDX(slotpos);
+ bitnum = BM_BIT(slotpos);
+ n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+ n125->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!node_leaf_256_is_chunk_used(n256, chunk))
+#else
+ if (!node_inner_256_is_chunk_used(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ node_leaf_256_delete(n256, chunk);
+#else
+ node_inner_256_delete(n256, chunk);
+#endif
+ break;
+ }
+ }
+
+ /* update statistics */
+ node->count--;
+
+ return true;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..c63fe9a3c0
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,293 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE rt_node_inner_4
+#define RT_NODE32_TYPE rt_node_inner_32
+#define RT_NODE125_TYPE rt_node_inner_125
+#define RT_NODE256_TYPE rt_node_inner_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE rt_node_leaf_4
+#define RT_NODE32_TYPE rt_node_leaf_32
+#define RT_NODE125_TYPE rt_node_leaf_125
+#define RT_NODE256_TYPE rt_node_leaf_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+ rt_node *newnode = NULL;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(NODE_IS_LEAF(node));
+#else
+ Assert(!NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ int idx;
+
+ idx = node_4_search_eq(&n4->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n4->values[idx] = value;
+#else
+ n4->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ {
+ RT_NODE32_TYPE *new32;
+
+ /* grow node from 4 to 32 */
+ newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+ new32 = (RT_NODE32_TYPE *) newnode;
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_values_array_copy(n4->base.chunks, n4->values,
+ new32->base.chunks, new32->values);
+#else
+ chunk_children_array_copy(n4->base.chunks, n4->children,
+ new32->base.chunks, new32->children);
+#endif
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, node, newnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos(&n4->base, chunk);
+ int count = n4->base.n.count;
+
+ /* shift chunks and children */
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_values_array_shift(n4->base.chunks, n4->values,
+ count, insertpos);
+#else
+ chunk_children_array_shift(n4->base.chunks, n4->children,
+ count, insertpos);
+#endif
+ }
+
+ n4->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n4->values[insertpos] = value;
+#else
+ n4->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ const rt_size_class_elem minclass = rt_size_class_info[RT_CLASS_32_PARTIAL];
+ const rt_size_class_elem maxclass = rt_size_class_info[RT_CLASS_32_FULL];
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx;
+
+ idx = node_32_search_eq(&n32->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[idx] = value;
+#else
+ n32->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+ n32->base.n.fanout == minclass.fanout)
+ {
+ /* grow to the next size class of this kind */
+#ifdef RT_NODE_LEVEL_LEAF
+ newnode = rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+ memcpy(newnode, node, minclass.leaf_size);
+#else
+ newnode = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+ memcpy(newnode, node, minclass.inner_size);
+#endif
+ newnode->fanout = maxclass.fanout;
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, node, newnode, key);
+ node = newnode;
+
+ /* also update pointer for this kind */
+ n32 = (RT_NODE32_TYPE *) newnode;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ {
+ RT_NODE125_TYPE *new125;
+
+ Assert(n32->base.n.fanout == maxclass.fanout);
+
+ /* grow node from 32 to 125 */
+ newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+ new125 = (RT_NODE125_TYPE *) newnode;
+
+ for (int i = 0; i < maxclass.fanout; i++)
+ {
+ new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ new125->values[i] = n32->values[i];
+#else
+ new125->children[i] = n32->children[i];
+#endif
+ }
+
+ Assert(maxclass.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+ new125->base.isset[0] = (bitmapword) (((uint64) 1 << maxclass.fanout) - 1);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, node, newnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = node_32_get_insertpos(&n32->base, chunk);
+ int count = n32->base.n.count;
+
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_values_array_shift(n32->base.chunks, n32->values,
+ count, insertpos);
+#else
+ chunk_children_array_shift(n32->base.chunks, n32->children,
+ count, insertpos);
+#endif
+ }
+
+ n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = value;
+#else
+ n32->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int cnt = 0;
+
+ if (slotpos != RT_NODE_125_INVALID_IDX)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = value;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ {
+ RT_NODE256_TYPE *new256;
+
+ /* grow node from 125 to 256 */
+ newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+ new256 = (RT_NODE256_TYPE *) newnode;
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!node_125_is_chunk_used(&n125->base, i))
+ continue;
+#ifdef RT_NODE_LEVEL_LEAF
+ node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
+#else
+ node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
+#endif
+ cnt++;
+ }
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, node, newnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int idx;
+ bitmapword inverse;
+
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < BM_IDX(128); idx++)
+ {
+ if (n125->base.isset[idx] < ~((bitmapword) 0))
+ break;
+ }
+
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(n125->base.isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+ Assert(slotpos < node->fanout);
+
+ /* mark the slot used */
+ n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+ n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = value;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+#else
+ chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+#endif
+ Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+#ifdef RT_NODE_LEVEL_LEAF
+ node_leaf_256_set(n256, chunk, value);
+#else
+ node_inner_256_set(n256, chunk, child);
+#endif
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..bebf8e725a
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,129 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE rt_node_inner_4
+#define RT_NODE32_TYPE rt_node_inner_32
+#define RT_NODE125_TYPE rt_node_inner_125
+#define RT_NODE256_TYPE rt_node_inner_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE rt_node_leaf_4
+#define RT_NODE32_TYPE rt_node_leaf_32
+#define RT_NODE125_TYPE rt_node_leaf_125
+#define RT_NODE256_TYPE rt_node_leaf_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+#ifdef RT_NODE_LEVEL_LEAF
+ uint64 value;
+#else
+ rt_node *child = NULL;
+#endif
+ bool found = false;
+ uint8 key_chunk;
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n4->values[node_iter->current_idx];
+#else
+ child = n4->children[node_iter->current_idx];
+#endif
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[node_iter->current_idx];
+#else
+ child = n32->children[node_iter->current_idx];
+#endif
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = node_leaf_125_get_value(n125, i);
+#else
+ child = node_inner_125_get_child(n125, i);
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+#ifdef RT_NODE_LEVEL_LEAF
+ if (node_leaf_256_is_chunk_used(n256, i))
+#else
+ if (node_inner_256_is_chunk_used(n256, i))
+#endif
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = node_leaf_256_get_value(n256, i);
+#else
+ child = node_inner_256_get_child(n256, i);
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = value;
+#endif
+ }
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return found;
+#else
+ return child;
+#endif
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..d0366f9bb6
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,102 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE rt_node_inner_4
+#define RT_NODE32_TYPE rt_node_inner_32
+#define RT_NODE125_TYPE rt_node_inner_125
+#define RT_NODE256_TYPE rt_node_inner_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE rt_node_leaf_4
+#define RT_NODE32_TYPE rt_node_leaf_32
+#define RT_NODE125_TYPE rt_node_leaf_125
+#define RT_NODE256_TYPE rt_node_leaf_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ uint64 value = 0;
+#else
+ rt_node *child = NULL;
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n4->values[idx];
+#else
+ child = n4->children[idx];
+#endif
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[idx];
+#else
+ child = n32->children[idx];
+#endif
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = node_leaf_125_get_value(n125, chunk);
+#else
+ child = node_inner_125_get_child(n125, chunk);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!node_leaf_256_is_chunk_used(n256, chunk))
+#else
+ if (!node_inner_256_is_chunk_used(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = node_leaf_256_get_value(n256, chunk);
+#else
+ child = node_inner_256_get_child(n256, chunk);
+#endif
+ break;
+ }
+ }
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(value_p != NULL);
+ *value_p = value;
+#else
+ Assert(child_p != NULL);
+ *child_p = child;
+#endif
+
+ return true;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
--
2.39.0
v17-0009-Implement-shared-memory.patchtext/x-patch; charset=US-ASCII; name=v17-0009-Implement-shared-memory.patchDownload
From c78f27a61d649b0981fc150c3894a0e1a992bcc0 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 9 Jan 2023 14:32:39 +0700
Subject: [PATCH v17 9/9] Implement shared memory
---
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 376 +++++++++++++-----
src/include/lib/radixtree_delete_impl.h | 6 +
src/include/lib/radixtree_insert_impl.h | 43 +-
src/include/lib/radixtree_iter_impl.h | 19 +-
src/include/lib/radixtree_search_impl.h | 28 +-
src/include/utils/dsa.h | 1 +
.../modules/test_radixtree/test_radixtree.c | 43 ++
8 files changed, 402 insertions(+), 126 deletions(-)
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 604b702a91..50f0aae3ab 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index b3d84da033..2b58a0cdf5 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -42,6 +42,8 @@
* - RT_DEFINE - if defined function definitions are generated
* - RT_SCOPE - in which scope (e.g. extern, static inline) do function
* declarations reside
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ * so that multiple processes can access it simultaneously.
*
* Optional parameters:
* - RT_DEBUG - if defined add stats tracking and debugging functions
@@ -51,6 +53,9 @@
*
* RT_CREATE - Create a new, empty radix tree
* RT_FREE - Free the radix tree
+ * RT_ATTACH - Attach to the radix tree
+ * RT_DETACH - Detach from the radix tree
+ * RT_GET_HANDLE - Return the handle of the radix tree
* RT_SEARCH - Search a key-value pair
* RT_SET - Set a key-value pair
* RT_DELETE - Delete a key-value pair
@@ -80,7 +85,8 @@
#include "miscadmin.h"
#include "nodes/bitmapset.h"
#include "port/pg_bitutils.h"
-#include "port/pg_lfind.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
#include "utils/memutils.h"
/* helpers */
@@ -92,6 +98,9 @@
#define RT_CREATE RT_MAKE_NAME(create)
#define RT_FREE RT_MAKE_NAME(free)
#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#endif
#define RT_SET RT_MAKE_NAME(set)
#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
@@ -110,9 +119,11 @@
#define RT_FREE_NODE RT_MAKE_NAME(free_node)
#define RT_EXTEND RT_MAKE_NAME(extend)
#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
-#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+//#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
#define RT_NODE_4_SEARCH_EQ RT_MAKE_NAME(node_4_search_eq)
#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
#define RT_NODE_4_GET_INSERTPOS RT_MAKE_NAME(node_4_get_insertpos)
@@ -138,6 +149,7 @@
#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
@@ -150,6 +162,7 @@
/* type declarations */
#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
#define RT_ITER RT_MAKE_NAME(iter)
#define RT_NODE RT_MAKE_NAME(node)
#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
@@ -181,8 +194,14 @@
typedef struct RT_RADIX_TREE RT_RADIX_TREE;
typedef struct RT_ITER RT_ITER;
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+#else
RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
@@ -301,9 +320,21 @@ typedef struct RT_NODE
uint8 kind;
} RT_NODE;
+
#define RT_PTR_LOCAL RT_NODE *
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
#define NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
#define NODE_IS_EMPTY(n) (((RT_PTR_LOCAL) (n))->count == 0)
@@ -512,21 +543,33 @@ static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
};
/* A radix tree with nodes */
-typedef struct RT_RADIX_TREE
+typedef struct RT_RADIX_TREE_CONTROL
{
- MemoryContext context;
-
RT_PTR_ALLOC root;
uint64 max_val;
uint64 num_keys;
- MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
- MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
-
/* statistics */
#ifdef RT_DEBUG
int32 cnt[RT_SIZE_CLASS_COUNT];
#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE
+{
+ MemoryContext context;
+
+ /* pointing to either local memory or DSA */
+ RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+ dsa_area *dsa;
+ dsa_pointer ctl_dp;
+#else
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
} RT_RADIX_TREE;
/*
@@ -542,6 +585,11 @@ typedef struct RT_RADIX_TREE
* construct the key whenever updating the node iteration information, e.g., when
* advancing the current index within the node or when moving to the next node
* at the same level.
++ *
++ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
++ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
++ * We need either a safeguard to disallow other processes to begin the iteration
++ * while one process is doing or to allow multiple processes to do the iteration.
*/
typedef struct RT_NODE_ITER
{
@@ -562,14 +610,35 @@ typedef struct RT_ITER
} RT_ITER;
-static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
- uint64 key, RT_PTR_LOCAL child);
-static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
uint64 key, uint64 value);
/* verification (available only with assertion) */
static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+ return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+ return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+ return DsaPointerIsValid(ptr);
+#else
+ return PointerIsValid(ptr);
+#endif
+}
+
/*
* Return index of the first element in 'base' that equals 'key'. Return -1
* if there is no such element.
@@ -801,7 +870,7 @@ static inline bool
RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
{
Assert(!NODE_IS_LEAF(node));
- return (node->children[chunk] != NULL);
+ return node->children[chunk] != RT_INVALID_PTR_ALLOC;
}
static inline bool
@@ -855,7 +924,7 @@ static inline void
RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
{
Assert(!NODE_IS_LEAF(node));
- node->children[chunk] = NULL;
+ node->children[chunk] = RT_INVALID_PTR_ALLOC;
}
static inline void
@@ -897,21 +966,31 @@ RT_SHIFT_GET_MAX_VAL(int shift)
static RT_PTR_ALLOC
RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
{
- RT_PTR_ALLOC newnode;
+ RT_PTR_ALLOC allocnode;
+ size_t allocsize;
if (inner)
- newnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
- RT_SIZE_CLASS_INFO[size_class].inner_size);
+ allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
else
- newnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
- RT_SIZE_CLASS_INFO[size_class].leaf_size);
+ allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+
+#ifdef RT_SHMEM
+ allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+ if (inner)
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+ allocsize);
+ else
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ allocsize);
+#endif
#ifdef RT_DEBUG
/* update the statistics */
- tree->cnt[size_class]++;
+ tree->ctl->cnt[size_class]++;
#endif
- return newnode;
+ return allocnode;
}
/* Initialize the node contents */
@@ -951,13 +1030,15 @@ RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
{
int shift = RT_KEY_GET_SHIFT(key);
bool inner = shift > 0;
- RT_PTR_ALLOC newnode;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
- newnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
newnode->shift = shift;
- tree->max_val = RT_SHIFT_GET_MAX_VAL(shift);
- tree->root = newnode;
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+ tree->ctl->root = allocnode;
}
static inline void
@@ -967,7 +1048,7 @@ RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
newnode->chunk = oldnode->chunk;
newnode->count = oldnode->count;
}
-
+#if 0
/*
* Create a new node with 'new_kind' and the same shift, chunk, and
* count of 'node'.
@@ -975,30 +1056,33 @@ RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
static RT_NODE*
RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_LOCAL node, uint8 new_kind)
{
- RT_PTR_ALLOC newnode;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
bool inner = !NODE_IS_LEAF(node);
- newnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+ allocnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
RT_INIT_NODE(newnode, new_kind, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
RT_COPY_NODE(newnode, node);
return newnode;
}
-
+#endif
/* Free the given node */
static void
-RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
{
/* If we're deleting the root node, make the tree empty */
- if (tree->root == node)
+ if (tree->ctl->root == allocnode)
{
- tree->root = NULL;
- tree->max_val = 0;
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+ tree->ctl->max_val = 0;
}
#ifdef RT_DEBUG
{
int i;
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
/* update the statistics */
for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
@@ -1011,12 +1095,26 @@ RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
if (i == RT_SIZE_CLASS_COUNT)
i = RT_CLASS_256;
- tree->cnt[i]--;
- Assert(tree->cnt[i] >= 0);
+ tree->ctl->cnt[i]--;
+ Assert(tree->ctl->cnt[i] >= 0);
}
#endif
- pfree(node);
+#ifdef RT_SHMEM
+ dsa_free(tree->dsa, allocnode);
+#else
+ pfree(allocnode);
+#endif
+}
+
+static inline bool
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
}
/*
@@ -1026,19 +1124,25 @@ static void
RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
RT_PTR_ALLOC new_child, uint64 key)
{
- Assert(old_child->chunk == new_child->chunk);
- Assert(old_child->shift == new_child->shift);
+ RT_PTR_LOCAL old = RT_PTR_GET_LOCAL(tree, old_child);
+
+#ifdef USE_ASSERT_CHECKING
+ RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+ Assert(old->chunk == new->chunk);
+ Assert(old->shift == new->shift);
+#endif
- if (parent == old_child)
+ if (parent == old)
{
/* Replace the root node with the new large node */
- tree->root = new_child;
+ tree->ctl->root = new_child;
}
else
{
- bool replaced PG_USED_FOR_ASSERTS_ONLY;
+ bool replaced PG_USED_FOR_ASSERTS_ONLY;
- replaced = RT_NODE_INSERT_INNER(tree, NULL, parent, key, new_child);
+ replaced = RT_NODE_UPDATE_INNER(parent, key, new_child);
Assert(replaced);
}
@@ -1053,7 +1157,8 @@ static void
RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
{
int target_shift;
- int shift = tree->root->shift + RT_NODE_SPAN;
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ int shift = root->shift + RT_NODE_SPAN;
target_shift = RT_KEY_GET_SHIFT(key);
@@ -1065,22 +1170,23 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
RT_NODE_INNER_4 *n4;
allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
- node = (RT_PTR_LOCAL) allocnode;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
RT_INIT_NODE(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
node->shift = shift;
node->count = 1;
n4 = (RT_NODE_INNER_4 *) node;
n4->base.chunks[0] = 0;
- n4->children[0] = tree->root;
+ n4->children[0] = tree->ctl->root;
- tree->root->chunk = 0;
- tree->root = node;
+ /* Update the root */
+ tree->ctl->root = allocnode;
+ root->chunk = 0;
shift += RT_NODE_SPAN;
}
- tree->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
}
/*
@@ -1089,10 +1195,12 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
*/
static inline void
RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
- RT_PTR_LOCAL node)
+ RT_PTR_ALLOC nodep, RT_PTR_LOCAL node)
{
int shift = node->shift;
+ Assert(RT_PTR_GET_LOCAL(tree, nodep) == node);
+
while (shift >= RT_NODE_SPAN)
{
RT_PTR_ALLOC allocchild;
@@ -1101,19 +1209,20 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent
bool inner = newshift > 0;
allocchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
- newchild = (RT_PTR_LOCAL) allocchild;
+ newchild = RT_PTR_GET_LOCAL(tree, allocchild);
RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
newchild->shift = newshift;
newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
- RT_NODE_INSERT_INNER(tree, parent, node, key, newchild);
+ RT_NODE_INSERT_INNER(tree, parent, nodep, node, key, allocchild);
parent = node;
node = newchild;
+ nodep = allocchild;
shift -= RT_NODE_SPAN;
}
- RT_NODE_INSERT_LEAF(tree, parent, node, key, value);
- tree->num_keys++;
+ RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+ tree->ctl->num_keys++;
}
/*
@@ -1172,8 +1281,8 @@ RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
/* Insert the child to the inner node */
static bool
-RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node, uint64 key,
- RT_PTR_ALLOC child)
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child)
{
#define RT_NODE_LEVEL_INNER
#include "lib/radixtree_insert_impl.h"
@@ -1182,7 +1291,7 @@ RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node
/* Insert the value to the leaf node */
static bool
-RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
uint64 key, uint64 value)
{
#define RT_NODE_LEVEL_LEAF
@@ -1194,18 +1303,26 @@ RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
* Create the radix tree in the given memory context and return it.
*/
RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa)
+#else
RT_CREATE(MemoryContext ctx)
+#endif
{
RT_RADIX_TREE *tree;
MemoryContext old_ctx;
old_ctx = MemoryContextSwitchTo(ctx);
- tree = palloc(sizeof(RT_RADIX_TREE));
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
tree->context = ctx;
- tree->root = NULL;
- tree->max_val = 0;
- tree->num_keys = 0;
+
+#ifdef RT_SHMEM
+ tree->dsa = dsa;
+ tree->ctl_dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, tree->ctl_dp);
+#else
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
/* Create the slab allocator for each size class */
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
@@ -1218,27 +1335,52 @@ RT_CREATE(MemoryContext ctx)
RT_SIZE_CLASS_INFO[i].name,
RT_SIZE_CLASS_INFO[i].leaf_blocksize,
RT_SIZE_CLASS_INFO[i].leaf_size);
-#ifdef RT_DEBUG
- tree->cnt[i] = 0;
-#endif
}
+#endif
+
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
MemoryContextSwitchTo(old_ctx);
return tree;
}
+#ifdef RT_SHMEM
+RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, dsa_pointer dp)
+{
+ RT_RADIX_TREE *tree;
+
+ /* XXX: memory context support */
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ tree->ctl_dp = dp;
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+
+ /* XXX: do we need to set a callback on exit to detach dsa? */
+
+ return tree;
+}
+#endif
+
/*
* Free the given radix tree.
*/
RT_SCOPE void
RT_FREE(RT_RADIX_TREE *tree)
{
+#ifdef RT_SHMEM
+ dsa_free(tree->dsa, tree->ctl_dp); // XXX
+ dsa_detach(tree->dsa);
+#else
+ pfree(tree->ctl);
+
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
MemoryContextDelete(tree->inner_slabs[i]);
MemoryContextDelete(tree->leaf_slabs[i]);
}
+#endif
pfree(tree);
}
@@ -1252,46 +1394,50 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
{
int shift;
bool updated;
- RT_PTR_LOCAL node;
RT_PTR_LOCAL parent;
+ RT_PTR_ALLOC nodep;
+ RT_PTR_LOCAL node;
/* Empty tree, create the root */
- if (!tree->root)
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
RT_NEW_ROOT(tree, key);
/* Extend the tree if necessary */
- if (key > tree->max_val)
+ if (key > tree->ctl->max_val)
RT_EXTEND(tree, key);
- Assert(tree->root);
+ //Assert(tree->ctl->root);
- shift = tree->root->shift;
- node = parent = tree->root;
+ nodep = tree->ctl->root;
+ parent = RT_PTR_GET_LOCAL(tree, nodep);
+ shift = parent->shift;
/* Descend the tree until a leaf node */
while (shift >= 0)
{
- RT_PTR_LOCAL child;
+ RT_PTR_ALLOC child;
+
+ node = RT_PTR_GET_LOCAL(tree, nodep);
if (NODE_IS_LEAF(node))
break;
if (!RT_NODE_SEARCH_INNER(node, key, &child))
{
- RT_SET_EXTEND(tree, key, value, parent, node);
+ RT_SET_EXTEND(tree, key, value, parent, nodep, node);
return false;
}
parent = node;
- node = child;
+ nodep = child;
shift -= RT_NODE_SPAN;
}
- updated = RT_NODE_INSERT_LEAF(tree, parent, node, key, value);
+ updated = RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
/* Update the statistics */
if (!updated)
- tree->num_keys++;
+ tree->ctl->num_keys++;
return updated;
}
@@ -1309,11 +1455,11 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
Assert(value_p != NULL);
- if (!tree->root || key > tree->max_val)
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
return false;
- node = tree->root;
- shift = tree->root->shift;
+ node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ shift = node->shift;
/* Descend the tree until a leaf node */
while (shift >= 0)
@@ -1326,7 +1472,7 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
if (!RT_NODE_SEARCH_INNER(node, key, &child))
return false;
- node = child;
+ node = RT_PTR_GET_LOCAL(tree, child);
shift -= RT_NODE_SPAN;
}
@@ -1341,37 +1487,40 @@ RT_SCOPE bool
RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
{
RT_PTR_LOCAL node;
+ RT_PTR_ALLOC allocnode;
RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
int shift;
int level;
bool deleted;
- if (!tree->root || key > tree->max_val)
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
return false;
/*
* Descend the tree to search the key while building a stack of nodes we
* visited.
*/
- node = tree->root;
- shift = tree->root->shift;
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
level = -1;
while (shift > 0)
{
RT_PTR_ALLOC child;
/* Push the current node to the stack */
- stack[++level] = node;
+ stack[++level] = allocnode;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
if (!RT_NODE_SEARCH_INNER(node, key, &child))
return false;
- node = child;
+ allocnode = child;
shift -= RT_NODE_SPAN;
}
/* Delete the key from the leaf node if exists */
- Assert(NODE_IS_LEAF(node));
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
deleted = RT_NODE_DELETE_LEAF(node, key);
if (!deleted)
@@ -1381,7 +1530,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
}
/* Found the key to delete. Update the statistics */
- tree->num_keys--;
+ tree->ctl->num_keys--;
/*
* Return if the leaf node still has keys and we don't need to delete the
@@ -1391,13 +1540,14 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
return true;
/* Free the empty leaf node */
- RT_FREE_NODE(tree, node);
+ RT_FREE_NODE(tree, allocnode);
/* Delete the key in inner nodes recursively */
while (level >= 0)
{
- node = stack[level--];
+ allocnode = stack[level--];
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
deleted = RT_NODE_DELETE_INNER(node, key);
Assert(deleted);
@@ -1406,7 +1556,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
break;
/* The node became empty */
- RT_FREE_NODE(tree, node);
+ RT_FREE_NODE(tree, allocnode);
}
return true;
@@ -1478,6 +1628,7 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
{
MemoryContext old_ctx;
RT_ITER *iter;
+ RT_PTR_LOCAL root;
int top_level;
old_ctx = MemoryContextSwitchTo(tree->context);
@@ -1486,17 +1637,18 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
iter->tree = tree;
/* empty tree */
- if (!iter->tree->root)
+ if (!iter->tree->ctl->root)
return iter;
- top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+ top_level = root->shift / RT_NODE_SPAN;
iter->stack_len = top_level;
/*
* Descend to the left most leaf node from the root. The key is being
* constructed while descending to the leaf.
*/
- RT_UPDATE_ITER_STACK(iter, iter->tree->root, top_level);
+ RT_UPDATE_ITER_STACK(iter, root, top_level);
MemoryContextSwitchTo(old_ctx);
@@ -1511,7 +1663,7 @@ RT_SCOPE bool
RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
{
/* Empty tree */
- if (!iter->tree->root)
+ if (!iter->tree->ctl->root)
return false;
for (;;)
@@ -1571,7 +1723,7 @@ RT_END_ITERATE(RT_ITER *iter)
RT_SCOPE uint64
RT_NUM_ENTRIES(RT_RADIX_TREE *tree)
{
- return tree->num_keys;
+ return tree->ctl->num_keys;
}
/*
@@ -1580,13 +1732,18 @@ RT_NUM_ENTRIES(RT_RADIX_TREE *tree)
RT_SCOPE uint64
RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
{
+ // XXX is this necessary?
Size total = sizeof(RT_RADIX_TREE);
+#ifdef RT_SHMEM
+ total = dsa_get_total_size(tree->dsa);
+#else
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
}
+#endif
return total;
}
@@ -1670,13 +1827,13 @@ void
rt_stats(RT_RADIX_TREE *tree)
{
ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
- tree->num_keys,
- tree->root->shift / RT_NODE_SPAN,
- tree->cnt[RT_CLASS_4_FULL],
- tree->cnt[RT_CLASS_32_PARTIAL],
- tree->cnt[RT_CLASS_32_FULL],
- tree->cnt[RT_CLASS_125_FULL],
- tree->cnt[RT_CLASS_256])));
+ tree->ctl->num_keys,
+ tree->ctl->root->shift / RT_NODE_SPAN,
+ tree->ctl->cnt[RT_CLASS_4_FULL],
+ tree->ctl->cnt[RT_CLASS_32_PARTIAL],
+ tree->ctl->cnt[RT_CLASS_32_FULL],
+ tree->ctl->cnt[RT_CLASS_125_FULL],
+ tree->ctl->cnt[RT_CLASS_256])));
}
static void
@@ -1848,23 +2005,23 @@ rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
elog(NOTICE, "-----------------------------------------------------------");
elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
- tree->max_val, tree->max_val);
+ tree->ctl->max_val, tree->ctl->max_val);
- if (!tree->root)
+ if (!tree->ctl->root)
{
elog(NOTICE, "tree is empty");
return;
}
- if (key > tree->max_val)
+ if (key > tree->ctl->max_val)
{
elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
key, key);
return;
}
- node = tree->root;
- shift = tree->root->shift;
+ node = tree->ctl->root;
+ shift = tree->ctl->root->shift;
while (shift >= 0)
{
RT_PTR_LOCAL child;
@@ -1901,15 +2058,15 @@ rt_dump(RT_RADIX_TREE *tree)
RT_SIZE_CLASS_INFO[i].inner_blocksize,
RT_SIZE_CLASS_INFO[i].leaf_size,
RT_SIZE_CLASS_INFO[i].leaf_blocksize);
- fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
- if (!tree->root)
+ if (!tree->ctl->root)
{
fprintf(stderr, "empty tree\n");
return;
}
- rt_dump_node(tree->root, 0, true);
+ rt_dump_node(tree->ctl->root, 0, true);
}
#endif
@@ -1931,6 +2088,7 @@ rt_dump(RT_RADIX_TREE *tree)
/* type declarations */
#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
#undef RT_ITER
#undef RT_NODE
#undef RT_NODE_ITER
@@ -1959,6 +2117,7 @@ rt_dump(RT_RADIX_TREE *tree)
/* function declarations */
#undef RT_CREATE
#undef RT_FREE
+#undef RT_ATTACH
#undef RT_SET
#undef RT_BEGIN_ITERATE
#undef RT_ITERATE_NEXT
@@ -1980,6 +2139,8 @@ rt_dump(RT_RADIX_TREE *tree)
#undef RT_GROW_NODE_KIND
#undef RT_COPY_NODE
#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
#undef RT_NODE_4_SEARCH_EQ
#undef RT_NODE_32_SEARCH_EQ
#undef RT_NODE_4_GET_INSERTPOS
@@ -2005,6 +2166,7 @@ rt_dump(RT_RADIX_TREE *tree)
#undef RT_SHIFT_GET_MAX_VAL
#undef RT_NODE_SEARCH_INNER
#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
#undef RT_NODE_DELETE_INNER
#undef RT_NODE_DELETE_LEAF
#undef RT_NODE_INSERT_INNER
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index 6eefc63e19..eb87866b90 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -16,6 +16,12 @@
uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(NODE_IS_LEAF(node));
+#else
+ Assert(!NODE_IS_LEAF(node));
+#endif
+
switch (node->kind)
{
case RT_NODE_KIND_4:
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index ff76583402..e4faf54d9d 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -14,11 +14,14 @@
uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
bool chunk_exists = false;
- RT_NODE *newnode = NULL;
+ RT_PTR_LOCAL newnode = NULL;
+ RT_PTR_ALLOC allocnode;
#ifdef RT_NODE_LEVEL_LEAF
+ const bool inner = false;
Assert(NODE_IS_LEAF(node));
#else
+ const bool inner = true;
Assert(!NODE_IS_LEAF(node));
#endif
@@ -45,9 +48,15 @@
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
{
RT_NODE32_TYPE *new32;
+ const uint8 new_kind = RT_NODE_KIND_32;
+ const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
/* grow node from 4 to 32 */
- newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, inner);
+ RT_COPY_NODE(newnode, node);
+ //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
new32 = (RT_NODE32_TYPE *) newnode;
#ifdef RT_NODE_LEVEL_LEAF
RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
@@ -57,7 +66,7 @@
new32->base.chunks, new32->children);
#endif
Assert(parent != NULL);
- RT_REPLACE_NODE(tree, parent, node, newnode, key);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
node = newnode;
}
else
@@ -112,17 +121,19 @@
n32->base.n.fanout == class32_min.fanout)
{
/* grow to the next size class of this kind */
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
+
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
#ifdef RT_NODE_LEVEL_LEAF
- newnode = RT_ALLOC_NODE(tree, RT_CLASS_32_FULL, false);
memcpy(newnode, node, class32_min.leaf_size);
#else
- newnode = RT_ALLOC_NODE(tree, RT_CLASS_32_FULL, true);
memcpy(newnode, node, class32_min.inner_size);
#endif
newnode->fanout = class32_max.fanout;
Assert(parent != NULL);
- RT_REPLACE_NODE(tree, parent, node, newnode, key);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
node = newnode;
/* also update pointer for this kind */
@@ -132,11 +143,17 @@
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
{
RT_NODE125_TYPE *new125;
+ const uint8 new_kind = RT_NODE_KIND_125;
+ const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
Assert(n32->base.n.fanout == class32_max.fanout);
/* grow node from 32 to 125 */
- newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, inner);
+ RT_COPY_NODE(newnode, node);
+ //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
new125 = (RT_NODE125_TYPE *) newnode;
for (int i = 0; i < class32_max.fanout; i++)
@@ -153,7 +170,7 @@
new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
Assert(parent != NULL);
- RT_REPLACE_NODE(tree, parent, node, newnode, key);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
node = newnode;
}
else
@@ -204,9 +221,15 @@
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
{
RT_NODE256_TYPE *new256;
+ const uint8 new_kind = RT_NODE_KIND_256;
+ const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
/* grow node from 125 to 256 */
- newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, inner);
+ RT_COPY_NODE(newnode, node);
+ //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
new256 = (RT_NODE256_TYPE *) newnode;
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
{
@@ -221,7 +244,7 @@
}
Assert(parent != NULL);
- RT_REPLACE_NODE(tree, parent, node, newnode, key);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
node = newnode;
}
else
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index a153011376..09d2018dc0 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -12,13 +12,18 @@
#error node level must be either inner or leaf
#endif
+ bool found = false;
+ uint8 key_chunk;
+
#ifdef RT_NODE_LEVEL_LEAF
uint64 value;
+
+ Assert(NODE_IS_LEAF(node_iter->node));
#else
- RT_NODE *child = NULL;
+ RT_PTR_LOCAL child = NULL;
+
+ Assert(!NODE_IS_LEAF(node_iter->node));
#endif
- bool found = false;
- uint8 key_chunk;
switch (node_iter->node->kind)
{
@@ -32,7 +37,7 @@
#ifdef RT_NODE_LEVEL_LEAF
value = n4->values[node_iter->current_idx];
#else
- child = n4->children[node_iter->current_idx];
+ child = RT_PTR_GET_LOCAL(iter->tree, n4->children[node_iter->current_idx]);
#endif
key_chunk = n4->base.chunks[node_iter->current_idx];
found = true;
@@ -49,7 +54,7 @@
#ifdef RT_NODE_LEVEL_LEAF
value = n32->values[node_iter->current_idx];
#else
- child = n32->children[node_iter->current_idx];
+ child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
#endif
key_chunk = n32->base.chunks[node_iter->current_idx];
found = true;
@@ -73,7 +78,7 @@
#ifdef RT_NODE_LEVEL_LEAF
value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
#else
- child = RT_NODE_INNER_125_GET_CHILD(n125, i);
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
#endif
key_chunk = i;
found = true;
@@ -101,7 +106,7 @@
#ifdef RT_NODE_LEVEL_LEAF
value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
#else
- child = RT_NODE_INNER_256_GET_CHILD(n256, i);
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
#endif
key_chunk = i;
found = true;
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index cbc357dcc8..3e97c31c2c 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -16,8 +16,13 @@
#ifdef RT_NODE_LEVEL_LEAF
uint64 value = 0;
+
+ Assert(NODE_IS_LEAF(node));
#else
- RT_PTR_LOCAL child = NULL;
+#ifndef RT_ACTION_UPDATE
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+#endif
+ Assert(!NODE_IS_LEAF(node));
#endif
switch (node->kind)
@@ -32,8 +37,12 @@
#ifdef RT_NODE_LEVEL_LEAF
value = n4->values[idx];
+#else
+#ifdef RT_ACTION_UPDATE
+ n4->children[idx] = new_child;
#else
child = n4->children[idx];
+#endif
#endif
break;
}
@@ -47,22 +56,31 @@
#ifdef RT_NODE_LEVEL_LEAF
value = n32->values[idx];
+#else
+#ifdef RT_ACTION_UPDATE
+ n32->children[idx] = new_child;
#else
child = n32->children[idx];
+#endif
#endif
break;
}
case RT_NODE_KIND_125:
{
RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
- if (!RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, chunk))
+ if (slotpos == RT_NODE_125_INVALID_IDX)
return false;
#ifdef RT_NODE_LEVEL_LEAF
value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+#ifdef RT_ACTION_UPDATE
+ n125->children[slotpos] = new_child;
#else
child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
#endif
break;
}
@@ -79,19 +97,25 @@
#ifdef RT_NODE_LEVEL_LEAF
value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+#ifdef RT_ACTION_UPDATE
+ RT_NODE_INNER_256_SET(n256, chunk, new_child);
#else
child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
#endif
break;
}
}
+#ifndef RT_ACTION_UPDATE
#ifdef RT_NODE_LEVEL_LEAF
Assert(value_p != NULL);
*value_p = value;
#else
Assert(child_p != NULL);
*child_p = child;
+#endif
#endif
return true;
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 104386e674..c67f936880 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 2256d08100..61d842789d 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -18,6 +18,7 @@
#include "nodes/bitmapset.h"
#include "storage/block.h"
#include "storage/itemptr.h"
+#include "storage/lwlock.h"
#include "utils/memutils.h"
#include "utils/timestamp.h"
@@ -103,6 +104,8 @@ static const test_spec test_specs[] = {
#define RT_SCOPE static
#define RT_DECLARE
#define RT_DEFINE
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
#include "lib/radixtree.h"
@@ -119,7 +122,15 @@ test_empty(void)
uint64 key;
uint64 val;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
radixtree = rt_create(CurrentMemoryContext);
+#endif
if (rt_search(radixtree, 0, &dummy))
elog(ERROR, "rt_search on empty tree returned true");
@@ -153,10 +164,20 @@ test_basic(int children, bool test_inner)
uint64 *keys;
int shift = test_inner ? 8 : 0;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
elog(NOTICE, "testing basic operations with %s node %d",
test_inner ? "inner" : "leaf", children);
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
radixtree = rt_create(CurrentMemoryContext);
+#endif
/* prepare keys in order like 1, 32, 2, 31, 2, ... */
keys = palloc(sizeof(uint64) * children);
@@ -297,9 +318,19 @@ test_node_types(uint8 shift)
{
rt_radix_tree *radixtree;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
radixtree = rt_create(CurrentMemoryContext);
+#endif
/*
* Insert and search entries for every node type at the 'shift' level,
@@ -332,6 +363,11 @@ test_pattern(const test_spec * spec)
int patternlen;
uint64 *pattern_values;
uint64 pattern_num_values;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
if (rt_test_stats)
@@ -357,7 +393,13 @@ test_pattern(const test_spec * spec)
"radixtree test",
ALLOCSET_SMALL_SIZES);
MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(radixtree_ctx, dsa);
+#else
radixtree = rt_create(radixtree_ctx);
+#endif
+
/*
* Add values to the set.
@@ -563,6 +605,7 @@ test_pattern(const test_spec * spec)
elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
nafter, (nbefore - ndeleted), ndeleted);
+ rt_free(radixtree);
MemoryContextDelete(radixtree_ctx);
}
--
2.39.0
v17-0008-Invent-specific-pointer-macros.patchtext/x-patch; charset=US-ASCII; name=v17-0008-Invent-specific-pointer-macros.patchDownload
From 46ac0171f5a3bd80dfea8ad4061b1567650b8061 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 6 Jan 2023 14:20:51 +0700
Subject: [PATCH v17 8/9] Invent specific pointer macros
RT_PTR_LOCAL - a normal pointer to local memory
RT_PTR_ALLOC - the result of allocation, possibly a DSA pointer
RT_EXTEND and RT_SET_EXTEND have some code changes to show
how these are meant to be treated differently, but most punted
until later.
---
src/include/lib/radixtree.h | 165 +++++++++++++-----------
src/include/lib/radixtree_search_impl.h | 2 +-
2 files changed, 89 insertions(+), 78 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index e4350730b7..b3d84da033 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -301,8 +301,12 @@ typedef struct RT_NODE
uint8 kind;
} RT_NODE;
-#define NODE_IS_LEAF(n) (((RT_NODE *) (n))->shift == 0)
-#define NODE_IS_EMPTY(n) (((RT_NODE *) (n))->count == 0)
+#define RT_PTR_LOCAL RT_NODE *
+
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+
+#define NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
+#define NODE_IS_EMPTY(n) (((RT_PTR_LOCAL) (n))->count == 0)
#define VAR_NODE_HAS_FREE_SLOT(node) \
((node)->base.n.count < (node)->base.n.fanout)
#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
@@ -366,7 +370,7 @@ typedef struct RT_NODE_INNER_4
RT_NODE_BASE_4 base;
/* number of children depends on size class */
- RT_NODE *children[FLEXIBLE_ARRAY_MEMBER];
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
} RT_NODE_INNER_4;
typedef struct RT_NODE_LEAF_4
@@ -382,7 +386,7 @@ typedef struct RT_NODE_INNER_32
RT_NODE_BASE_32 base;
/* number of children depends on size class */
- RT_NODE *children[FLEXIBLE_ARRAY_MEMBER];
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
} RT_NODE_INNER_32;
typedef struct RT_NODE_LEAF_32
@@ -398,7 +402,7 @@ typedef struct RT_NODE_INNER_125
RT_NODE_BASE_125 base;
/* number of children depends on size class */
- RT_NODE *children[FLEXIBLE_ARRAY_MEMBER];
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
} RT_NODE_INNER_125;
typedef struct RT_NODE_LEAF_125
@@ -418,7 +422,7 @@ typedef struct RT_NODE_INNER_256
RT_NODE_BASE_256 base;
/* Slots for 256 children */
- RT_NODE *children[RT_NODE_MAX_SLOTS];
+ RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
} RT_NODE_INNER_256;
typedef struct RT_NODE_LEAF_256
@@ -458,33 +462,33 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
[RT_CLASS_4_FULL] = {
.name = "radix tree node 4",
.fanout = 4,
- .inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_NODE *),
+ .inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
.leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_NODE *)),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC)),
.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64)),
},
[RT_CLASS_32_PARTIAL] = {
.name = "radix tree node 15",
.fanout = 15,
- .inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_NODE *),
+ .inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_NODE *)),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC)),
.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64)),
},
[RT_CLASS_32_FULL] = {
.name = "radix tree node 32",
.fanout = 32,
- .inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_NODE *),
+ .inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_NODE *)),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC)),
.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64)),
},
[RT_CLASS_125_FULL] = {
.name = "radix tree node 125",
.fanout = 125,
- .inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_NODE *),
+ .inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_NODE *)),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC)),
.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64)),
},
[RT_CLASS_256] = {
@@ -512,7 +516,7 @@ typedef struct RT_RADIX_TREE
{
MemoryContext context;
- RT_NODE *root;
+ RT_PTR_ALLOC root;
uint64 max_val;
uint64 num_keys;
@@ -541,7 +545,7 @@ typedef struct RT_RADIX_TREE
*/
typedef struct RT_NODE_ITER
{
- RT_NODE *node; /* current node being iterated */
+ RT_PTR_LOCAL node; /* current node being iterated */
int current_idx; /* current position. -1 for initial value */
} RT_NODE_ITER;
@@ -558,13 +562,13 @@ typedef struct RT_ITER
} RT_ITER;
-static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *node,
- uint64 key, RT_NODE *child);
-static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *node,
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_LOCAL child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
uint64 key, uint64 value);
/* verification (available only with assertion) */
-static void RT_VERIFY_NODE(RT_NODE *node);
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
/*
* Return index of the first element in 'base' that equals 'key'. Return -1
@@ -713,10 +717,10 @@ RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
/* Shift the elements right at 'idx' by one */
static inline void
-RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_NODE **children, int count, int idx)
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
{
memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
- memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_NODE *) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
}
static inline void
@@ -728,10 +732,10 @@ RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, uint64 *values, int count, int idx)
/* Delete the element at 'idx' */
static inline void
-RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_NODE **children, int count, int idx)
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
{
memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
- memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_NODE *) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
}
static inline void
@@ -743,12 +747,12 @@ RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, uint64 *values, int count, int idx)
/* Copy both chunks and children/values arrays */
static inline void
-RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_NODE **src_children,
- uint8 *dst_chunks, RT_NODE **dst_children)
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+ uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
{
const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
- const Size children_size = sizeof(RT_NODE *) * fanout;
+ const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
memcpy(dst_chunks, src_chunks, chunk_size);
memcpy(dst_children, src_children, children_size);
@@ -775,7 +779,7 @@ RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
}
-static inline RT_NODE *
+static inline RT_PTR_ALLOC
RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
{
Assert(!NODE_IS_LEAF(node));
@@ -810,7 +814,7 @@ RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
}
-static inline RT_NODE *
+static inline RT_PTR_ALLOC
RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
{
Assert(!NODE_IS_LEAF(node));
@@ -828,7 +832,7 @@ RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
/* Set the child in the node-256 */
static inline void
-RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_NODE *child)
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
{
Assert(!NODE_IS_LEAF(node));
node->children[chunk] = child;
@@ -890,16 +894,16 @@ RT_SHIFT_GET_MAX_VAL(int shift)
/*
* Allocate a new node with the given node kind.
*/
-static RT_NODE *
+static RT_PTR_ALLOC
RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
{
- RT_NODE *newnode;
+ RT_PTR_ALLOC newnode;
if (inner)
- newnode = (RT_NODE *) MemoryContextAlloc(tree->inner_slabs[size_class],
+ newnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
RT_SIZE_CLASS_INFO[size_class].inner_size);
else
- newnode = (RT_NODE *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ newnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
RT_SIZE_CLASS_INFO[size_class].leaf_size);
#ifdef RT_DEBUG
@@ -912,7 +916,7 @@ RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
/* Initialize the node contents */
static inline void
-RT_INIT_NODE(RT_NODE *node, uint8 kind, RT_SIZE_CLASS size_class, bool inner)
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner)
{
if (inner)
MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
@@ -947,7 +951,7 @@ RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
{
int shift = RT_KEY_GET_SHIFT(key);
bool inner = shift > 0;
- RT_NODE *newnode;
+ RT_PTR_ALLOC newnode;
newnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
@@ -957,7 +961,7 @@ RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
}
static inline void
-RT_COPY_NODE(RT_NODE *newnode, RT_NODE *oldnode)
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
{
newnode->shift = oldnode->shift;
newnode->chunk = oldnode->chunk;
@@ -969,9 +973,9 @@ RT_COPY_NODE(RT_NODE *newnode, RT_NODE *oldnode)
* count of 'node'.
*/
static RT_NODE*
-RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_NODE *node, uint8 new_kind)
+RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_LOCAL node, uint8 new_kind)
{
- RT_NODE *newnode;
+ RT_PTR_ALLOC newnode;
bool inner = !NODE_IS_LEAF(node);
newnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
@@ -983,7 +987,7 @@ RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_NODE *node, uint8 new_kind)
/* Free the given node */
static void
-RT_FREE_NODE(RT_RADIX_TREE *tree, RT_NODE *node)
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
{
/* If we're deleting the root node, make the tree empty */
if (tree->root == node)
@@ -1019,8 +1023,8 @@ RT_FREE_NODE(RT_RADIX_TREE *tree, RT_NODE *node)
* Replace old_child with new_child, and free the old one.
*/
static void
-RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *old_child,
- RT_NODE *new_child, uint64 key)
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
+ RT_PTR_ALLOC new_child, uint64 key)
{
Assert(old_child->chunk == new_child->chunk);
Assert(old_child->shift == new_child->shift);
@@ -1056,17 +1060,22 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
/* Grow tree from 'shift' to 'target_shift' */
while (shift <= target_shift)
{
- RT_NODE_INNER_4 *node;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ RT_NODE_INNER_4 *n4;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
+ node = (RT_PTR_LOCAL) allocnode;
+ RT_INIT_NODE(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+ node->shift = shift;
+ node->count = 1;
- node = (RT_NODE_INNER_4 *) RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
- RT_INIT_NODE((RT_NODE *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
- node->base.n.shift = shift;
- node->base.n.count = 1;
- node->base.chunks[0] = 0;
- node->children[0] = tree->root;
+ n4 = (RT_NODE_INNER_4 *) node;
+ n4->base.chunks[0] = 0;
+ n4->children[0] = tree->root;
tree->root->chunk = 0;
- tree->root = (RT_NODE *) node;
+ tree->root = node;
shift += RT_NODE_SPAN;
}
@@ -1079,18 +1088,20 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
* Insert inner and leaf nodes from 'node' to bottom.
*/
static inline void
-RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_NODE *parent,
- RT_NODE *node)
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
+ RT_PTR_LOCAL node)
{
int shift = node->shift;
while (shift >= RT_NODE_SPAN)
{
- RT_NODE *newchild;
+ RT_PTR_ALLOC allocchild;
+ RT_PTR_LOCAL newchild;
int newshift = shift - RT_NODE_SPAN;
bool inner = newshift > 0;
- newchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ newchild = (RT_PTR_LOCAL) allocchild;
RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
newchild->shift = newshift;
newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
@@ -1112,7 +1123,7 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_NODE *parent,
* pointer is set to child_p.
*/
static inline bool
-RT_NODE_SEARCH_INNER(RT_NODE *node, uint64 key, RT_NODE **child_p)
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
{
#define RT_NODE_LEVEL_INNER
#include "lib/radixtree_search_impl.h"
@@ -1126,7 +1137,7 @@ RT_NODE_SEARCH_INNER(RT_NODE *node, uint64 key, RT_NODE **child_p)
* to the value is set to value_p.
*/
static inline bool
-RT_NODE_SEARCH_LEAF(RT_NODE *node, uint64 key, uint64 *value_p)
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, uint64 *value_p)
{
#define RT_NODE_LEVEL_LEAF
#include "lib/radixtree_search_impl.h"
@@ -1139,7 +1150,7 @@ RT_NODE_SEARCH_LEAF(RT_NODE *node, uint64 key, uint64 *value_p)
* Delete the node and return true if the key is found, otherwise return false.
*/
static inline bool
-RT_NODE_DELETE_INNER(RT_NODE *node, uint64 key)
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
{
#define RT_NODE_LEVEL_INNER
#include "lib/radixtree_delete_impl.h"
@@ -1152,7 +1163,7 @@ RT_NODE_DELETE_INNER(RT_NODE *node, uint64 key)
* Delete the node and return true if the key is found, otherwise return false.
*/
static inline bool
-RT_NODE_DELETE_LEAF(RT_NODE *node, uint64 key)
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
{
#define RT_NODE_LEVEL_LEAF
#include "lib/radixtree_delete_impl.h"
@@ -1161,8 +1172,8 @@ RT_NODE_DELETE_LEAF(RT_NODE *node, uint64 key)
/* Insert the child to the inner node */
static bool
-RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *node, uint64 key,
- RT_NODE *child)
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node, uint64 key,
+ RT_PTR_ALLOC child)
{
#define RT_NODE_LEVEL_INNER
#include "lib/radixtree_insert_impl.h"
@@ -1171,7 +1182,7 @@ RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *node, uint64
/* Insert the value to the leaf node */
static bool
-RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *node,
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
uint64 key, uint64 value)
{
#define RT_NODE_LEVEL_LEAF
@@ -1241,8 +1252,8 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
{
int shift;
bool updated;
- RT_NODE *node;
- RT_NODE *parent;
+ RT_PTR_LOCAL node;
+ RT_PTR_LOCAL parent;
/* Empty tree, create the root */
if (!tree->root)
@@ -1260,7 +1271,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
/* Descend the tree until a leaf node */
while (shift >= 0)
{
- RT_NODE *child;
+ RT_PTR_LOCAL child;
if (NODE_IS_LEAF(node))
break;
@@ -1293,7 +1304,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
RT_SCOPE bool
RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
{
- RT_NODE *node;
+ RT_PTR_LOCAL node;
int shift;
Assert(value_p != NULL);
@@ -1307,7 +1318,7 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
/* Descend the tree until a leaf node */
while (shift >= 0)
{
- RT_NODE *child;
+ RT_PTR_ALLOC child;
if (NODE_IS_LEAF(node))
break;
@@ -1329,8 +1340,8 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
RT_SCOPE bool
RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
{
- RT_NODE *node;
- RT_NODE *stack[RT_MAX_LEVEL] = {0};
+ RT_PTR_LOCAL node;
+ RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
int shift;
int level;
bool deleted;
@@ -1347,7 +1358,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
level = -1;
while (shift > 0)
{
- RT_NODE *child;
+ RT_PTR_ALLOC child;
/* Push the current node to the stack */
stack[++level] = node;
@@ -1412,7 +1423,7 @@ RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
* Advance the slot in the inner node. Return the child if exists, otherwise
* null.
*/
-static inline RT_NODE *
+static inline RT_PTR_LOCAL
RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
{
#define RT_NODE_LEVEL_INNER
@@ -1437,10 +1448,10 @@ RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
* Update each node_iter for inner nodes in the iterator node stack.
*/
static void
-RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_NODE *from_node, int from)
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
{
int level = from;
- RT_NODE *node = from_node;
+ RT_PTR_LOCAL node = from_node;
for (;;)
{
@@ -1505,7 +1516,7 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
for (;;)
{
- RT_NODE *child = NULL;
+ RT_PTR_LOCAL child = NULL;
uint64 value;
int level;
bool found;
@@ -1584,7 +1595,7 @@ RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
* Verify the radix tree node.
*/
static void
-RT_VERIFY_NODE(RT_NODE *node)
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
{
#ifdef USE_ASSERT_CHECKING
Assert(node->count >= 0);
@@ -1669,7 +1680,7 @@ rt_stats(RT_RADIX_TREE *tree)
}
static void
-rt_dump_node(RT_NODE *node, int level, bool recurse)
+rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
{
char space[125] = {0};
@@ -1831,7 +1842,7 @@ rt_dump_node(RT_NODE *node, int level, bool recurse)
void
rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
{
- RT_NODE *node;
+ RT_PTR_LOCAL node;
int shift;
int level = 0;
@@ -1856,7 +1867,7 @@ rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
shift = tree->root->shift;
while (shift >= 0)
{
- RT_NODE *child;
+ RT_PTR_LOCAL child;
rt_dump_node(node, level, false);
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index 1a0d2d3f1f..cbc357dcc8 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -17,7 +17,7 @@
#ifdef RT_NODE_LEVEL_LEAF
uint64 value = 0;
#else
- RT_NODE *child = NULL;
+ RT_PTR_LOCAL child = NULL;
#endif
switch (node->kind)
--
2.39.0
v17-0007-Convert-radixtree.h-into-a-template.patchtext/x-patch; charset=US-ASCII; name=v17-0007-Convert-radixtree.h-into-a-template.patchDownload
From b4857416c4030057a79cf52cdd7ffff88f55f73c Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 4 Jan 2023 14:43:17 +0700
Subject: [PATCH v17 7/9] Convert radixtree.h into a template
The only thing configurable at this point is function scope
and prefix, since the point is to see if this makes a shared
memory implementation clear and maintainable.
The key and value type are still hard-coded to uint64.
To make this more useful, at least value type should be
configurable.
It might be good at some point to offer a different tree type,
e.g. "single-value leaves" to allow for variable length keys
and values, giving full flexibility to developers.
---
src/include/lib/radixtree.h | 987 +++++++++++-------
src/include/lib/radixtree_delete_impl.h | 36 +-
src/include/lib/radixtree_insert_impl.h | 92 +-
src/include/lib/radixtree_iter_impl.h | 34 +-
src/include/lib/radixtree_search_impl.h | 36 +-
.../modules/test_radixtree/test_radixtree.c | 23 +-
6 files changed, 718 insertions(+), 490 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index fe517793f4..e4350730b7 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -29,24 +29,41 @@
*
* XXX: the radix tree node never be shrunk.
*
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included. Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * will result in radix tree type 'foo_radix_tree' and functions like
+ * 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ * generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ * declarations reside
+ *
+ * Optional parameters:
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
* Interface
* ---------
*
- * rt_create - Create a new, empty radix tree
- * rt_free - Free the radix tree
- * rt_search - Search a key-value pair
- * rt_set - Set a key-value pair
- * rt_delete - Delete a key-value pair
- * rt_begin_iterate - Begin iterating through all key-value pairs
- * rt_iterate_next - Return next key-value pair, if any
- * rt_end_iter - End iteration
- * rt_memory_usage - Get the memory usage
- * rt_num_entries - Get the number of key-value pairs
+ * RT_CREATE - Create a new, empty radix tree
+ * RT_FREE - Free the radix tree
+ * RT_SEARCH - Search a key-value pair
+ * RT_SET - Set a key-value pair
+ * RT_DELETE - Delete a key-value pair
+ * RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT - Return next key-value pair, if any
+ * RT_END_ITER - End iteration
+ * RT_MEMORY_USAGE - Get the memory usage
+ * RT_NUM_ENTRIES - Get the number of key-value pairs
*
- * rt_create() creates an empty radix tree in the given memory context
+ * RT_CREATE() creates an empty radix tree in the given memory context
* and memory contexts for all kinds of radix tree node under the memory context.
*
- * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * RT_ITERATE_NEXT() ensures returning key-value pairs in the ascending
* order of the key.
*
* Copyright (c) 2022, PostgreSQL Global Development Group
@@ -66,6 +83,133 @@
#include "port/pg_lfind.h"
#include "utils/memutils.h"
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#define RT_DELETE RT_MAKE_NAME(delete)
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#define RT_NUM_ENTRIES RT_MAKE_NAME(num_entries)
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_NODE_4_SEARCH_EQ RT_MAKE_NAME(node_4_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_4_GET_INSERTPOS RT_MAKE_NAME(node_4_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_ITER RT_MAKE_NAME(iter)
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_4 RT_MAKE_NAME(node_inner_4)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_4 RT_MAKE_NAME(node_leaf_4)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_4_FULL RT_MAKE_NAME(class_4_full)
+#define RT_CLASS_32_PARTIAL RT_MAKE_NAME(class_32_partial)
+#define RT_CLASS_32_FULL RT_MAKE_NAME(class_32_full)
+#define RT_CLASS_125_FULL RT_MAKE_NAME(class_125_full)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+#define RT_KIND_MIN_SIZE_CLASS RT_MAKE_NAME(kind_min_size_class)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+RT_SCOPE uint64 RT_NUM_ENTRIES(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif /* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* macros and types common to all implementations */
+#ifndef RT_COMMON
+#define RT_COMMON
+
#ifdef RT_DEBUG
#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
#endif
@@ -80,7 +224,7 @@
#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
/* Maximum shift the radix tree uses */
-#define RT_MAX_SHIFT key_get_shift(UINT64_MAX)
+#define RT_MAX_SHIFT RT_KEY_GET_SHIFT(UINT64_MAX)
/* Tree level the radix tree uses */
#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
@@ -101,7 +245,7 @@
* There are 4 node kinds and each node kind have one or two size classes,
* partial and full. The size classes in the same node kind have the same
* node structure but have the different number of fanout that is stored
- * in 'fanout' of rt_node. For example in size class 15, when a 16th element
+ * in 'fanout' of RT_NODE. For example in size class 15, when a 16th element
* is to be inserted, we allocate a larger area and memcpy the entire old
* node to it.
*
@@ -119,19 +263,20 @@
#define RT_NODE_KIND_256 0x03
#define RT_NODE_KIND_COUNT 4
-typedef enum rt_size_class
+#endif /* RT_COMMON */
+
+
+typedef enum RT_SIZE_CLASS
{
RT_CLASS_4_FULL = 0,
RT_CLASS_32_PARTIAL,
RT_CLASS_32_FULL,
RT_CLASS_125_FULL,
RT_CLASS_256
-
-#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
-} rt_size_class;
+} RT_SIZE_CLASS;
/* Common type for all nodes types */
-typedef struct rt_node
+typedef struct RT_NODE
{
/*
* Number of children. We use uint16 to be able to indicate 256 children
@@ -154,53 +299,54 @@ typedef struct rt_node
/* Node kind, one per search/set algorithm */
uint8 kind;
-} rt_node;
-#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
-#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
+} RT_NODE;
+
+#define NODE_IS_LEAF(n) (((RT_NODE *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n) (((RT_NODE *) (n))->count == 0)
#define VAR_NODE_HAS_FREE_SLOT(node) \
((node)->base.n.count < (node)->base.n.fanout)
#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
- ((node)->base.n.count < rt_size_class_info[class].fanout)
+ ((node)->base.n.count < RT_SIZE_CLASS_INFO[class].fanout)
/* Base type of each node kinds for leaf and inner nodes */
/* The base types must be a be able to accommodate the largest size
class for variable-sized node kinds*/
-typedef struct rt_node_base_4
+typedef struct RT_NODE_BASE_4
{
- rt_node n;
+ RT_NODE n;
/* 4 children, for key chunks */
uint8 chunks[4];
-} rt_node_base_4;
+} RT_NODE_BASE_4;
-typedef struct rt_node_base32
+typedef struct RT_NODE_BASE_32
{
- rt_node n;
+ RT_NODE n;
/* 32 children, for key chunks */
uint8 chunks[32];
-} rt_node_base_32;
+} RT_NODE_BASE_32;
/*
* node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
* 256, to store indexes into a second array that contains up to 125 values (or
* child pointers in inner nodes).
*/
-typedef struct rt_node_base125
+typedef struct RT_NODE_BASE_125
{
- rt_node n;
+ RT_NODE n;
/* The index of slots for each fanout */
uint8 slot_idxs[RT_NODE_MAX_SLOTS];
/* isset is a bitmap to track which slot is in use */
bitmapword isset[BM_IDX(128)];
-} rt_node_base_125;
+} RT_NODE_BASE_125;
-typedef struct rt_node_base256
+typedef struct RT_NODE_BASE_256
{
- rt_node n;
-} rt_node_base_256;
+ RT_NODE n;
+} RT_NODE_BASE_256;
/*
* Inner and leaf nodes.
@@ -215,79 +361,79 @@ typedef struct rt_node_base256
* good. It might be better to just indicate non-existing entries the same way
* in inner nodes.
*/
-typedef struct rt_node_inner_4
+typedef struct RT_NODE_INNER_4
{
- rt_node_base_4 base;
+ RT_NODE_BASE_4 base;
/* number of children depends on size class */
- rt_node *children[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_inner_4;
+ RT_NODE *children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_4;
-typedef struct rt_node_leaf_4
+typedef struct RT_NODE_LEAF_4
{
- rt_node_base_4 base;
+ RT_NODE_BASE_4 base;
/* number of values depends on size class */
uint64 values[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_leaf_4;
+} RT_NODE_LEAF_4;
-typedef struct rt_node_inner_32
+typedef struct RT_NODE_INNER_32
{
- rt_node_base_32 base;
+ RT_NODE_BASE_32 base;
/* number of children depends on size class */
- rt_node *children[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_inner_32;
+ RT_NODE *children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
-typedef struct rt_node_leaf_32
+typedef struct RT_NODE_LEAF_32
{
- rt_node_base_32 base;
+ RT_NODE_BASE_32 base;
/* number of values depends on size class */
uint64 values[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_leaf_32;
+} RT_NODE_LEAF_32;
-typedef struct rt_node_inner_125
+typedef struct RT_NODE_INNER_125
{
- rt_node_base_125 base;
+ RT_NODE_BASE_125 base;
/* number of children depends on size class */
- rt_node *children[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_inner_125;
+ RT_NODE *children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
-typedef struct rt_node_leaf_125
+typedef struct RT_NODE_LEAF_125
{
- rt_node_base_125 base;
+ RT_NODE_BASE_125 base;
/* number of values depends on size class */
uint64 values[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_leaf_125;
+} RT_NODE_LEAF_125;
/*
* node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
* for directly storing values (or child pointers in inner nodes).
*/
-typedef struct rt_node_inner_256
+typedef struct RT_NODE_INNER_256
{
- rt_node_base_256 base;
+ RT_NODE_BASE_256 base;
/* Slots for 256 children */
- rt_node *children[RT_NODE_MAX_SLOTS];
-} rt_node_inner_256;
+ RT_NODE *children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
-typedef struct rt_node_leaf_256
+typedef struct RT_NODE_LEAF_256
{
- rt_node_base_256 base;
+ RT_NODE_BASE_256 base;
/* isset is a bitmap to track which slot is in use */
bitmapword isset[BM_IDX(RT_NODE_MAX_SLOTS)];
/* Slots for 256 values */
uint64 values[RT_NODE_MAX_SLOTS];
-} rt_node_leaf_256;
+} RT_NODE_LEAF_256;
/* Information for each size class */
-typedef struct rt_size_class_elem
+typedef struct RT_SIZE_CLASS_ELEM
{
const char *name;
int fanout;
@@ -299,7 +445,7 @@ typedef struct rt_size_class_elem
/* slab block size */
Size inner_blocksize;
Size leaf_blocksize;
-} rt_size_class_elem;
+} RT_SIZE_CLASS_ELEM;
/*
* Calculate the slab blocksize so that we can allocate at least 32 chunks
@@ -307,51 +453,54 @@ typedef struct rt_size_class_elem
*/
#define NODE_SLAB_BLOCK_SIZE(size) \
Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
-static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
[RT_CLASS_4_FULL] = {
.name = "radix tree node 4",
.fanout = 4,
- .inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
- .leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+ .inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_NODE *),
+ .leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_NODE *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64)),
},
[RT_CLASS_32_PARTIAL] = {
.name = "radix tree node 15",
.fanout = 15,
- .inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
- .leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
+ .inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_NODE *),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_NODE *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64)),
},
[RT_CLASS_32_FULL] = {
.name = "radix tree node 32",
.fanout = 32,
- .inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
- .leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+ .inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_NODE *),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_NODE *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64)),
},
[RT_CLASS_125_FULL] = {
.name = "radix tree node 125",
.fanout = 125,
- .inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
- .leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
+ .inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_NODE *),
+ .leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_NODE *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64)),
},
[RT_CLASS_256] = {
.name = "radix tree node 256",
.fanout = 256,
- .inner_size = sizeof(rt_node_inner_256),
- .leaf_size = sizeof(rt_node_leaf_256),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+ .inner_size = sizeof(RT_NODE_INNER_256),
+ .leaf_size = sizeof(RT_NODE_LEAF_256),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_256)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_256)),
},
};
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
/* Map from the node kind to its minimum size class */
-static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
+static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
[RT_NODE_KIND_4] = RT_CLASS_4_FULL,
[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
@@ -359,11 +508,11 @@ static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
};
/* A radix tree with nodes */
-typedef struct radix_tree
+typedef struct RT_RADIX_TREE
{
MemoryContext context;
- rt_node *root;
+ RT_NODE *root;
uint64 max_val;
uint64 num_keys;
@@ -374,7 +523,7 @@ typedef struct radix_tree
#ifdef RT_DEBUG
int32 cnt[RT_SIZE_CLASS_COUNT];
#endif
-} radix_tree;
+} RT_RADIX_TREE;
/*
* Iteration support.
@@ -382,79 +531,47 @@ typedef struct radix_tree
* Iterating the radix tree returns each pair of key and value in the ascending
* order of the key. To support this, the we iterate nodes of each level.
*
- * rt_node_iter struct is used to track the iteration within a node.
+ * RT_NODE_ITER struct is used to track the iteration within a node.
*
- * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
* in order to track the iteration of each level. During the iteration, we also
* construct the key whenever updating the node iteration information, e.g., when
* advancing the current index within the node or when moving to the next node
* at the same level.
*/
-typedef struct rt_node_iter
+typedef struct RT_NODE_ITER
{
- rt_node *node; /* current node being iterated */
+ RT_NODE *node; /* current node being iterated */
int current_idx; /* current position. -1 for initial value */
-} rt_node_iter;
+} RT_NODE_ITER;
-typedef struct rt_iter
+typedef struct RT_ITER
{
- radix_tree *tree;
+ RT_RADIX_TREE *tree;
/* Track the iteration on nodes of each level */
- rt_node_iter stack[RT_MAX_LEVEL];
+ RT_NODE_ITER stack[RT_MAX_LEVEL];
int stack_len;
/* The key is being constructed during the iteration */
uint64 key;
-} rt_iter;
-
-extern radix_tree *rt_create(MemoryContext ctx);
-extern void rt_free(radix_tree *tree);
-extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
-extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
-extern rt_iter *rt_begin_iterate(radix_tree *tree);
+} RT_ITER;
-extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
-extern void rt_end_iterate(rt_iter *iter);
-extern bool rt_delete(radix_tree *tree, uint64 key);
-extern uint64 rt_memory_usage(radix_tree *tree);
-extern uint64 rt_num_entries(radix_tree *tree);
-
-#ifdef RT_DEBUG
-extern void rt_dump(radix_tree *tree);
-extern void rt_dump_search(radix_tree *tree, uint64 key);
-extern void rt_stats(radix_tree *tree);
-#endif
-
-
-static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
-static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
- bool inner);
-static void rt_free_node(radix_tree *tree, rt_node *node);
-static void rt_extend(radix_tree *tree, uint64 key);
-static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p);
-static inline bool rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p);
-static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
- uint64 key, rt_node *child);
-static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *node,
+ uint64 key, RT_NODE *child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *node,
uint64 key, uint64 value);
-static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
-static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
- uint64 *value_p);
-static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
-static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
/* verification (available only with assertion) */
-static void rt_verify_node(rt_node *node);
+static void RT_VERIFY_NODE(RT_NODE *node);
/*
* Return index of the first element in 'base' that equals 'key'. Return -1
* if there is no such element.
*/
static inline int
-node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+RT_NODE_4_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
{
int idx = -1;
@@ -474,7 +591,7 @@ node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
* Return index of the chunk to insert into chunks in the given node.
*/
static inline int
-node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+RT_NODE_4_GET_INSERTPOS(RT_NODE_BASE_4 *node, uint8 chunk)
{
int idx;
@@ -492,7 +609,7 @@ node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
* if there is no such element.
*/
static inline int
-node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
{
int count = node->n.count;
#ifndef USE_NO_SIMD
@@ -541,7 +658,7 @@ node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
* Return index of the chunk to insert into chunks in the given node.
*/
static inline int
-node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
{
int count = node->n.count;
#ifndef USE_NO_SIMD
@@ -596,14 +713,14 @@ node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
/* Shift the elements right at 'idx' by one */
static inline void
-chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_NODE **children, int count, int idx)
{
memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
- memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_NODE *) * (count - idx));
}
static inline void
-chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, uint64 *values, int count, int idx)
{
memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
@@ -611,14 +728,14 @@ chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
/* Delete the element at 'idx' */
static inline void
-chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_NODE **children, int count, int idx)
{
memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
- memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_NODE *) * (count - idx - 1));
}
static inline void
-chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, uint64 *values, int count, int idx)
{
memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
@@ -626,22 +743,22 @@ chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
/* Copy both chunks and children/values arrays */
static inline void
-chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
- uint8 *dst_chunks, rt_node **dst_children)
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_NODE **src_children,
+ uint8 *dst_chunks, RT_NODE **dst_children)
{
- const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
- const Size children_size = sizeof(rt_node *) * fanout;
+ const Size children_size = sizeof(RT_NODE *) * fanout;
memcpy(dst_chunks, src_chunks, chunk_size);
memcpy(dst_children, src_children, children_size);
}
static inline void
-chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, uint64 *src_values,
uint8 *dst_chunks, uint64 *dst_values)
{
- const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
const Size values_size = sizeof(uint64) * fanout;
@@ -653,23 +770,23 @@ chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
/* Does the given chunk in the node has the value? */
static inline bool
-node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
{
return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
}
-static inline rt_node *
-node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
+static inline RT_NODE *
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
{
Assert(!NODE_IS_LEAF(node));
return node->children[node->base.slot_idxs[chunk]];
}
static inline uint64
-node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
{
Assert(NODE_IS_LEAF(node));
- Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+ Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
return node->values[node->base.slot_idxs[chunk]];
}
@@ -677,14 +794,14 @@ node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
/* Return true if the slot corresponding to the given chunk is in use */
static inline bool
-node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
{
Assert(!NODE_IS_LEAF(node));
return (node->children[chunk] != NULL);
}
static inline bool
-node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
{
int idx = BM_IDX(chunk);
int bitnum = BM_BIT(chunk);
@@ -693,25 +810,25 @@ node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
}
-static inline rt_node *
-node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+static inline RT_NODE *
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
{
Assert(!NODE_IS_LEAF(node));
- Assert(node_inner_256_is_chunk_used(node, chunk));
+ Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
return node->children[chunk];
}
static inline uint64
-node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
{
Assert(NODE_IS_LEAF(node));
- Assert(node_leaf_256_is_chunk_used(node, chunk));
+ Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
return node->values[chunk];
}
/* Set the child in the node-256 */
static inline void
-node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_NODE *child)
{
Assert(!NODE_IS_LEAF(node));
node->children[chunk] = child;
@@ -719,7 +836,7 @@ node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
/* Set the value in the node-256 */
static inline void
-node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, uint64 value)
{
int idx = BM_IDX(chunk);
int bitnum = BM_BIT(chunk);
@@ -731,14 +848,14 @@ node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
/* Set the slot at the given chunk position */
static inline void
-node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
{
Assert(!NODE_IS_LEAF(node));
node->children[chunk] = NULL;
}
static inline void
-node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
{
int idx = BM_IDX(chunk);
int bitnum = BM_BIT(chunk);
@@ -751,7 +868,7 @@ node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
* Return the shift that is satisfied to store the given key.
*/
static inline int
-key_get_shift(uint64 key)
+RT_KEY_GET_SHIFT(uint64 key)
{
return (key == 0)
? 0
@@ -762,7 +879,7 @@ key_get_shift(uint64 key)
* Return the max value stored in a node with the given shift.
*/
static uint64
-shift_get_max_val(int shift)
+RT_SHIFT_GET_MAX_VAL(int shift)
{
if (shift == RT_MAX_SHIFT)
return UINT64_MAX;
@@ -770,38 +887,20 @@ shift_get_max_val(int shift)
return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
}
-/*
- * Create a new node as the root. Subordinate nodes will be created during
- * the insertion.
- */
-static void
-rt_new_root(radix_tree *tree, uint64 key)
-{
- int shift = key_get_shift(key);
- bool inner = shift > 0;
- rt_node *newnode;
-
- newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
- rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
- newnode->shift = shift;
- tree->max_val = shift_get_max_val(shift);
- tree->root = newnode;
-}
-
/*
* Allocate a new node with the given node kind.
*/
-static rt_node *
-rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
+static RT_NODE *
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
{
- rt_node *newnode;
+ RT_NODE *newnode;
if (inner)
- newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
- rt_size_class_info[size_class].inner_size);
+ newnode = (RT_NODE *) MemoryContextAlloc(tree->inner_slabs[size_class],
+ RT_SIZE_CLASS_INFO[size_class].inner_size);
else
- newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
- rt_size_class_info[size_class].leaf_size);
+ newnode = (RT_NODE *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ RT_SIZE_CLASS_INFO[size_class].leaf_size);
#ifdef RT_DEBUG
/* update the statistics */
@@ -813,20 +912,20 @@ rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
/* Initialize the node contents */
static inline void
-rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+RT_INIT_NODE(RT_NODE *node, uint8 kind, RT_SIZE_CLASS size_class, bool inner)
{
if (inner)
- MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
else
- MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
node->kind = kind;
- node->fanout = rt_size_class_info[size_class].fanout;
+ node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
/* Initialize slot_idxs to invalid values */
if (kind == RT_NODE_KIND_125)
{
- rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
}
@@ -839,8 +938,26 @@ rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
node->fanout = 0;
}
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+ int shift = RT_KEY_GET_SHIFT(key);
+ bool inner = shift > 0;
+ RT_NODE *newnode;
+
+ newnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newnode->shift = shift;
+ tree->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+ tree->root = newnode;
+}
+
static inline void
-rt_copy_node(rt_node *newnode, rt_node *oldnode)
+RT_COPY_NODE(RT_NODE *newnode, RT_NODE *oldnode)
{
newnode->shift = oldnode->shift;
newnode->chunk = oldnode->chunk;
@@ -851,22 +968,22 @@ rt_copy_node(rt_node *newnode, rt_node *oldnode)
* Create a new node with 'new_kind' and the same shift, chunk, and
* count of 'node'.
*/
-static rt_node*
-rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+static RT_NODE*
+RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_NODE *node, uint8 new_kind)
{
- rt_node *newnode;
+ RT_NODE *newnode;
bool inner = !NODE_IS_LEAF(node);
- newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
- rt_init_node(newnode, new_kind, kind_min_size_class[new_kind], inner);
- rt_copy_node(newnode, node);
+ newnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+ RT_INIT_NODE(newnode, new_kind, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+ RT_COPY_NODE(newnode, node);
return newnode;
}
/* Free the given node */
static void
-rt_free_node(radix_tree *tree, rt_node *node)
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_NODE *node)
{
/* If we're deleting the root node, make the tree empty */
if (tree->root == node)
@@ -882,7 +999,7 @@ rt_free_node(radix_tree *tree, rt_node *node)
/* update the statistics */
for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
- if (node->fanout == rt_size_class_info[i].fanout)
+ if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
break;
}
@@ -902,8 +1019,8 @@ rt_free_node(radix_tree *tree, rt_node *node)
* Replace old_child with new_child, and free the old one.
*/
static void
-rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
- rt_node *new_child, uint64 key)
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *old_child,
+ RT_NODE *new_child, uint64 key)
{
Assert(old_child->chunk == new_child->chunk);
Assert(old_child->shift == new_child->shift);
@@ -917,11 +1034,11 @@ rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
{
bool replaced PG_USED_FOR_ASSERTS_ONLY;
- replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+ replaced = RT_NODE_INSERT_INNER(tree, NULL, parent, key, new_child);
Assert(replaced);
}
- rt_free_node(tree, old_child);
+ RT_FREE_NODE(tree, old_child);
}
/*
@@ -929,32 +1046,32 @@ rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
* store the key.
*/
static void
-rt_extend(radix_tree *tree, uint64 key)
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
{
int target_shift;
int shift = tree->root->shift + RT_NODE_SPAN;
- target_shift = key_get_shift(key);
+ target_shift = RT_KEY_GET_SHIFT(key);
/* Grow tree from 'shift' to 'target_shift' */
while (shift <= target_shift)
{
- rt_node_inner_4 *node;
+ RT_NODE_INNER_4 *node;
- node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
- rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+ node = (RT_NODE_INNER_4 *) RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
+ RT_INIT_NODE((RT_NODE *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
node->base.n.shift = shift;
node->base.n.count = 1;
node->base.chunks[0] = 0;
node->children[0] = tree->root;
tree->root->chunk = 0;
- tree->root = (rt_node *) node;
+ tree->root = (RT_NODE *) node;
shift += RT_NODE_SPAN;
}
- tree->max_val = shift_get_max_val(target_shift);
+ tree->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
}
/*
@@ -962,29 +1079,29 @@ rt_extend(radix_tree *tree, uint64 key)
* Insert inner and leaf nodes from 'node' to bottom.
*/
static inline void
-rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
- rt_node *node)
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_NODE *parent,
+ RT_NODE *node)
{
int shift = node->shift;
while (shift >= RT_NODE_SPAN)
{
- rt_node *newchild;
+ RT_NODE *newchild;
int newshift = shift - RT_NODE_SPAN;
bool inner = newshift > 0;
- newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
- rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
newchild->shift = newshift;
newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
- rt_node_insert_inner(tree, parent, node, key, newchild);
+ RT_NODE_INSERT_INNER(tree, parent, node, key, newchild);
parent = node;
node = newchild;
shift -= RT_NODE_SPAN;
}
- rt_node_insert_leaf(tree, parent, node, key, value);
+ RT_NODE_INSERT_LEAF(tree, parent, node, key, value);
tree->num_keys++;
}
@@ -995,7 +1112,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
* pointer is set to child_p.
*/
static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p)
+RT_NODE_SEARCH_INNER(RT_NODE *node, uint64 key, RT_NODE **child_p)
{
#define RT_NODE_LEVEL_INNER
#include "lib/radixtree_search_impl.h"
@@ -1009,7 +1126,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p)
* to the value is set to value_p.
*/
static inline bool
-rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p)
+RT_NODE_SEARCH_LEAF(RT_NODE *node, uint64 key, uint64 *value_p)
{
#define RT_NODE_LEVEL_LEAF
#include "lib/radixtree_search_impl.h"
@@ -1022,7 +1139,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p)
* Delete the node and return true if the key is found, otherwise return false.
*/
static inline bool
-rt_node_delete_inner(rt_node *node, uint64 key)
+RT_NODE_DELETE_INNER(RT_NODE *node, uint64 key)
{
#define RT_NODE_LEVEL_INNER
#include "lib/radixtree_delete_impl.h"
@@ -1035,7 +1152,7 @@ rt_node_delete_inner(rt_node *node, uint64 key)
* Delete the node and return true if the key is found, otherwise return false.
*/
static inline bool
-rt_node_delete_leaf(rt_node *node, uint64 key)
+RT_NODE_DELETE_LEAF(RT_NODE *node, uint64 key)
{
#define RT_NODE_LEVEL_LEAF
#include "lib/radixtree_delete_impl.h"
@@ -1044,8 +1161,8 @@ rt_node_delete_leaf(rt_node *node, uint64 key)
/* Insert the child to the inner node */
static bool
-rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
- rt_node *child)
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *node, uint64 key,
+ RT_NODE *child)
{
#define RT_NODE_LEVEL_INNER
#include "lib/radixtree_insert_impl.h"
@@ -1054,7 +1171,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
/* Insert the value to the leaf node */
static bool
-rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *node,
uint64 key, uint64 value)
{
#define RT_NODE_LEVEL_LEAF
@@ -1065,15 +1182,15 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
/*
* Create the radix tree in the given memory context and return it.
*/
-radix_tree *
-rt_create(MemoryContext ctx)
+RT_SCOPE RT_RADIX_TREE *
+RT_CREATE(MemoryContext ctx)
{
- radix_tree *tree;
+ RT_RADIX_TREE *tree;
MemoryContext old_ctx;
old_ctx = MemoryContextSwitchTo(ctx);
- tree = palloc(sizeof(radix_tree));
+ tree = palloc(sizeof(RT_RADIX_TREE));
tree->context = ctx;
tree->root = NULL;
tree->max_val = 0;
@@ -1083,13 +1200,13 @@ rt_create(MemoryContext ctx)
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
tree->inner_slabs[i] = SlabContextCreate(ctx,
- rt_size_class_info[i].name,
- rt_size_class_info[i].inner_blocksize,
- rt_size_class_info[i].inner_size);
+ RT_SIZE_CLASS_INFO[i].name,
+ RT_SIZE_CLASS_INFO[i].inner_blocksize,
+ RT_SIZE_CLASS_INFO[i].inner_size);
tree->leaf_slabs[i] = SlabContextCreate(ctx,
- rt_size_class_info[i].name,
- rt_size_class_info[i].leaf_blocksize,
- rt_size_class_info[i].leaf_size);
+ RT_SIZE_CLASS_INFO[i].name,
+ RT_SIZE_CLASS_INFO[i].leaf_blocksize,
+ RT_SIZE_CLASS_INFO[i].leaf_size);
#ifdef RT_DEBUG
tree->cnt[i] = 0;
#endif
@@ -1103,8 +1220,8 @@ rt_create(MemoryContext ctx)
/*
* Free the given radix tree.
*/
-void
-rt_free(radix_tree *tree)
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
{
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
@@ -1119,21 +1236,21 @@ rt_free(radix_tree *tree)
* Set key to value. If the entry already exists, we update its value to 'value'
* and return true. Returns false if entry doesn't yet exist.
*/
-bool
-rt_set(radix_tree *tree, uint64 key, uint64 value)
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
{
int shift;
bool updated;
- rt_node *node;
- rt_node *parent;
+ RT_NODE *node;
+ RT_NODE *parent;
/* Empty tree, create the root */
if (!tree->root)
- rt_new_root(tree, key);
+ RT_NEW_ROOT(tree, key);
/* Extend the tree if necessary */
if (key > tree->max_val)
- rt_extend(tree, key);
+ RT_EXTEND(tree, key);
Assert(tree->root);
@@ -1143,14 +1260,14 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
/* Descend the tree until a leaf node */
while (shift >= 0)
{
- rt_node *child;
+ RT_NODE *child;
if (NODE_IS_LEAF(node))
break;
- if (!rt_node_search_inner(node, key, &child))
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
{
- rt_set_extend(tree, key, value, parent, node);
+ RT_SET_EXTEND(tree, key, value, parent, node);
return false;
}
@@ -1159,7 +1276,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
shift -= RT_NODE_SPAN;
}
- updated = rt_node_insert_leaf(tree, parent, node, key, value);
+ updated = RT_NODE_INSERT_LEAF(tree, parent, node, key, value);
/* Update the statistics */
if (!updated)
@@ -1173,10 +1290,10 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
* otherwise return false. On success, we set the value to *val_p so it must
* not be NULL.
*/
-bool
-rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
{
- rt_node *node;
+ RT_NODE *node;
int shift;
Assert(value_p != NULL);
@@ -1190,30 +1307,30 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
/* Descend the tree until a leaf node */
while (shift >= 0)
{
- rt_node *child;
+ RT_NODE *child;
if (NODE_IS_LEAF(node))
break;
- if (!rt_node_search_inner(node, key, &child))
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
return false;
node = child;
shift -= RT_NODE_SPAN;
}
- return rt_node_search_leaf(node, key, value_p);
+ return RT_NODE_SEARCH_LEAF(node, key, value_p);
}
/*
* Delete the given key from the radix tree. Return true if the key is found (and
* deleted), otherwise do nothing and return false.
*/
-bool
-rt_delete(radix_tree *tree, uint64 key)
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
{
- rt_node *node;
- rt_node *stack[RT_MAX_LEVEL] = {0};
+ RT_NODE *node;
+ RT_NODE *stack[RT_MAX_LEVEL] = {0};
int shift;
int level;
bool deleted;
@@ -1230,12 +1347,12 @@ rt_delete(radix_tree *tree, uint64 key)
level = -1;
while (shift > 0)
{
- rt_node *child;
+ RT_NODE *child;
/* Push the current node to the stack */
stack[++level] = node;
- if (!rt_node_search_inner(node, key, &child))
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
return false;
node = child;
@@ -1244,7 +1361,7 @@ rt_delete(radix_tree *tree, uint64 key)
/* Delete the key from the leaf node if exists */
Assert(NODE_IS_LEAF(node));
- deleted = rt_node_delete_leaf(node, key);
+ deleted = RT_NODE_DELETE_LEAF(node, key);
if (!deleted)
{
@@ -1263,14 +1380,14 @@ rt_delete(radix_tree *tree, uint64 key)
return true;
/* Free the empty leaf node */
- rt_free_node(tree, node);
+ RT_FREE_NODE(tree, node);
/* Delete the key in inner nodes recursively */
while (level >= 0)
{
node = stack[level--];
- deleted = rt_node_delete_inner(node, key);
+ deleted = RT_NODE_DELETE_INNER(node, key);
Assert(deleted);
/* If the node didn't become empty, we stop deleting the key */
@@ -1278,55 +1395,56 @@ rt_delete(radix_tree *tree, uint64 key)
break;
/* The node became empty */
- rt_free_node(tree, node);
+ RT_FREE_NODE(tree, node);
}
return true;
}
-/* Create and return the iterator for the given radix tree */
-rt_iter *
-rt_begin_iterate(radix_tree *tree)
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
{
- MemoryContext old_ctx;
- rt_iter *iter;
- int top_level;
-
- old_ctx = MemoryContextSwitchTo(tree->context);
-
- iter = (rt_iter *) palloc0(sizeof(rt_iter));
- iter->tree = tree;
-
- /* empty tree */
- if (!iter->tree->root)
- return iter;
-
- top_level = iter->tree->root->shift / RT_NODE_SPAN;
- iter->stack_len = top_level;
-
- /*
- * Descend to the left most leaf node from the root. The key is being
- * constructed while descending to the leaf.
- */
- rt_update_iter_stack(iter, iter->tree->root, top_level);
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
- MemoryContextSwitchTo(old_ctx);
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_NODE *
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
- return iter;
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+ uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
}
/*
* Update each node_iter for inner nodes in the iterator node stack.
*/
static void
-rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_NODE *from_node, int from)
{
int level = from;
- rt_node *node = from_node;
+ RT_NODE *node = from_node;
for (;;)
{
- rt_node_iter *node_iter = &(iter->stack[level--]);
+ RT_NODE_ITER *node_iter = &(iter->stack[level--]);
node_iter->node = node;
node_iter->current_idx = -1;
@@ -1336,19 +1454,50 @@ rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
return;
/* Advance to the next slot in the inner node */
- node = rt_node_inner_iterate_next(iter, node_iter);
+ node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
/* We must find the first children in the node */
Assert(node);
}
}
+/* Create and return the iterator for the given radix tree */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+ MemoryContext old_ctx;
+ RT_ITER *iter;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree->root)
+ return iter;
+
+ top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ RT_UPDATE_ITER_STACK(iter, iter->tree->root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
/*
* Return true with setting key_p and value_p if there is next key. Otherwise,
* return false.
*/
-bool
-rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
{
/* Empty tree */
if (!iter->tree->root)
@@ -1356,13 +1505,13 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
for (;;)
{
- rt_node *child = NULL;
+ RT_NODE *child = NULL;
uint64 value;
int level;
bool found;
/* Advance the leaf node iterator to get next key-value pair */
- found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+ found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
if (found)
{
@@ -1377,7 +1526,7 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
*/
for (level = 1; level <= iter->stack_len; level++)
{
- child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+ child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
if (child)
break;
@@ -1391,7 +1540,7 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
* Set the node to the node iterator and update the iterator stack
* from this node.
*/
- rt_update_iter_stack(iter, child, level - 1);
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);
/* Node iterators are updated, so try again from the leaf */
}
@@ -1399,49 +1548,17 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
return false;
}
-void
-rt_end_iterate(rt_iter *iter)
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
{
pfree(iter);
}
-static inline void
-rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
-{
- iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
- iter->key |= (((uint64) chunk) << shift);
-}
-
-/*
- * Advance the slot in the inner node. Return the child if exists, otherwise
- * null.
- */
-static inline rt_node *
-rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
-{
-#define RT_NODE_LEVEL_INNER
-#include "lib/radixtree_iter_impl.h"
-#undef RT_NODE_LEVEL_INNER
-}
-
-/*
- * Advance the slot in the leaf node. On success, return true and the value
- * is set to value_p, otherwise return false.
- */
-static inline bool
-rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
- uint64 *value_p)
-{
-#define RT_NODE_LEVEL_LEAF
-#include "lib/radixtree_iter_impl.h"
-#undef RT_NODE_LEVEL_LEAF
-}
-
/*
* Return the number of keys in the radix tree.
*/
-uint64
-rt_num_entries(radix_tree *tree)
+RT_SCOPE uint64
+RT_NUM_ENTRIES(RT_RADIX_TREE *tree)
{
return tree->num_keys;
}
@@ -1449,10 +1566,10 @@ rt_num_entries(radix_tree *tree)
/*
* Return the statistics of the amount of memory used by the radix tree.
*/
-uint64
-rt_memory_usage(radix_tree *tree)
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
{
- Size total = sizeof(radix_tree);
+ Size total = sizeof(RT_RADIX_TREE);
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
@@ -1467,7 +1584,7 @@ rt_memory_usage(radix_tree *tree)
* Verify the radix tree node.
*/
static void
-rt_verify_node(rt_node *node)
+RT_VERIFY_NODE(RT_NODE *node)
{
#ifdef USE_ASSERT_CHECKING
Assert(node->count >= 0);
@@ -1476,7 +1593,7 @@ rt_verify_node(rt_node *node)
{
case RT_NODE_KIND_4:
{
- rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+ RT_NODE_BASE_4 *n4 = (RT_NODE_BASE_4 *) node;
for (int i = 1; i < n4->n.count; i++)
Assert(n4->chunks[i - 1] < n4->chunks[i]);
@@ -1485,7 +1602,7 @@ rt_verify_node(rt_node *node)
}
case RT_NODE_KIND_32:
{
- rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+ RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
for (int i = 1; i < n32->n.count; i++)
Assert(n32->chunks[i - 1] < n32->chunks[i]);
@@ -1494,7 +1611,7 @@ rt_verify_node(rt_node *node)
}
case RT_NODE_KIND_125:
{
- rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
int cnt = 0;
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -1503,7 +1620,7 @@ rt_verify_node(rt_node *node)
int idx = BM_IDX(slot);
int bitnum = BM_BIT(slot);
- if (!node_125_is_chunk_used(n125, i))
+ if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
continue;
/* Check if the corresponding slot is used */
@@ -1520,7 +1637,7 @@ rt_verify_node(rt_node *node)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
int cnt = 0;
for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
@@ -1539,7 +1656,7 @@ rt_verify_node(rt_node *node)
/***************** DEBUG FUNCTIONS *****************/
#ifdef RT_DEBUG
void
-rt_stats(radix_tree *tree)
+rt_stats(RT_RADIX_TREE *tree)
{
ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
tree->num_keys,
@@ -1552,7 +1669,7 @@ rt_stats(radix_tree *tree)
}
static void
-rt_dump_node(rt_node *node, int level, bool recurse)
+rt_dump_node(RT_NODE *node, int level, bool recurse)
{
char space[125] = {0};
@@ -1575,14 +1692,14 @@ rt_dump_node(rt_node *node, int level, bool recurse)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
space, n4->base.chunks[i], n4->values[i]);
}
else
{
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
fprintf(stderr, "%schunk 0x%X ->",
space, n4->base.chunks[i]);
@@ -1601,14 +1718,14 @@ rt_dump_node(rt_node *node, int level, bool recurse)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
space, n32->base.chunks[i], n32->values[i]);
}
else
{
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
fprintf(stderr, "%schunk 0x%X ->",
space, n32->base.chunks[i]);
@@ -1625,19 +1742,19 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
case RT_NODE_KIND_125:
{
- rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+ RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
fprintf(stderr, "slot_idxs ");
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
{
- if (!node_125_is_chunk_used(b125, i))
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
continue;
fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
}
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+ RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
fprintf(stderr, ", isset-bitmap:");
for (int i = 0; i < BM_IDX(128); i++)
@@ -1649,25 +1766,25 @@ rt_dump_node(rt_node *node, int level, bool recurse)
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
{
- if (!node_125_is_chunk_used(b125, i))
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
continue;
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) b125;
+ RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
- space, i, node_leaf_125_get_value(n125, i));
+ space, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
}
else
{
- rt_node_inner_125 *n125 = (rt_node_inner_125 *) b125;
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
fprintf(stderr, "%schunk 0x%X ->",
space, i);
if (recurse)
- rt_dump_node(node_inner_125_get_child(n125, i),
+ rt_dump_node(RT_NODE_INNER_125_GET_CHILD(n125, i),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -1681,26 +1798,26 @@ rt_dump_node(rt_node *node, int level, bool recurse)
{
if (NODE_IS_LEAF(node))
{
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
- if (!node_leaf_256_is_chunk_used(n256, i))
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
continue;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
- space, i, node_leaf_256_get_value(n256, i));
+ space, i, RT_NODE_LEAF_256_GET_VALUE(n256, i));
}
else
{
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
- if (!node_inner_256_is_chunk_used(n256, i))
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
continue;
fprintf(stderr, "%schunk 0x%X ->",
space, i);
if (recurse)
- rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+ rt_dump_node(RT_NODE_INNER_256_GET_CHILD(n256, i), level + 1,
recurse);
else
fprintf(stderr, "\n");
@@ -1712,9 +1829,9 @@ rt_dump_node(rt_node *node, int level, bool recurse)
}
void
-rt_dump_search(radix_tree *tree, uint64 key)
+rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
{
- rt_node *node;
+ RT_NODE *node;
int shift;
int level = 0;
@@ -1739,7 +1856,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
shift = tree->root->shift;
while (shift >= 0)
{
- rt_node *child;
+ RT_NODE *child;
rt_dump_node(node, level, false);
@@ -1748,12 +1865,12 @@ rt_dump_search(radix_tree *tree, uint64 key)
uint64 dummy;
/* We reached at a leaf node, find the corresponding slot */
- rt_node_search_leaf(node, key, &dummy);
+ RT_NODE_SEARCH_LEAF(node, key, &dummy);
break;
}
- if (!rt_node_search_inner(node, key, &child))
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
break;
node = child;
@@ -1763,16 +1880,16 @@ rt_dump_search(radix_tree *tree, uint64 key)
}
void
-rt_dump(radix_tree *tree)
+rt_dump(RT_RADIX_TREE *tree)
{
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
- rt_size_class_info[i].name,
- rt_size_class_info[i].inner_size,
- rt_size_class_info[i].inner_blocksize,
- rt_size_class_info[i].leaf_size,
- rt_size_class_info[i].leaf_blocksize);
+ RT_SIZE_CLASS_INFO[i].name,
+ RT_SIZE_CLASS_INFO[i].inner_size,
+ RT_SIZE_CLASS_INFO[i].inner_blocksize,
+ RT_SIZE_CLASS_INFO[i].leaf_size,
+ RT_SIZE_CLASS_INFO[i].leaf_blocksize);
fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
if (!tree->root)
@@ -1784,3 +1901,107 @@ rt_dump(radix_tree *tree)
rt_dump_node(tree->root, 0, true);
}
#endif
+
+#endif /* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+
+/* locally declared macros */
+#undef NODE_IS_LEAF
+#undef NODE_IS_EMPTY
+#undef VAR_NODE_HAS_FREE_SLOT
+#undef FIXED_NODE_HAS_FREE_SLOT
+#undef RT_SIZE_CLASS_COUNT
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_BASE_4
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_4
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_4
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_4_FULL
+#undef RT_CLASS_32_PARTIAL
+#undef RT_CLASS_32_FULL
+#undef RT_CLASS_125_FULL
+#undef RT_CLASS_256
+#undef RT_KIND_MIN_SIZE_CLASS
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_NUM_ENTRIES
+#undef RT_DUMP
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_GROW_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_NODE_4_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_4_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index 24fd9cc02b..6eefc63e19 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -1,15 +1,15 @@
/* TODO: shrink nodes */
#if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE rt_node_inner_4
-#define RT_NODE32_TYPE rt_node_inner_32
-#define RT_NODE125_TYPE rt_node_inner_125
-#define RT_NODE256_TYPE rt_node_inner_256
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
#elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE rt_node_leaf_4
-#define RT_NODE32_TYPE rt_node_leaf_32
-#define RT_NODE125_TYPE rt_node_leaf_125
-#define RT_NODE256_TYPE rt_node_leaf_256
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
#else
#error node level must be either inner or leaf
#endif
@@ -21,16 +21,16 @@
case RT_NODE_KIND_4:
{
RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
- int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ int idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
if (idx < 0)
return false;
#ifdef RT_NODE_LEVEL_LEAF
- chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+ RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, (uint64 *) n4->values,
n4->base.n.count, idx);
#else
- chunk_children_array_delete(n4->base.chunks, n4->children,
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
n4->base.n.count, idx);
#endif
break;
@@ -38,16 +38,16 @@
case RT_NODE_KIND_32:
{
RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
- int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
if (idx < 0)
return false;
#ifdef RT_NODE_LEVEL_LEAF
- chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+ RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, (uint64 *) n32->values,
n32->base.n.count, idx);
#else
- chunk_children_array_delete(n32->base.chunks, n32->children,
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
n32->base.n.count, idx);
#endif
break;
@@ -74,16 +74,16 @@
RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
#ifdef RT_NODE_LEVEL_LEAF
- if (!node_leaf_256_is_chunk_used(n256, chunk))
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
#else
- if (!node_inner_256_is_chunk_used(n256, chunk))
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
#endif
return false;
#ifdef RT_NODE_LEVEL_LEAF
- node_leaf_256_delete(n256, chunk);
+ RT_NODE_LEAF_256_DELETE(n256, chunk);
#else
- node_inner_256_delete(n256, chunk);
+ RT_NODE_INNER_256_DELETE(n256, chunk);
#endif
break;
}
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index c63fe9a3c0..ff76583402 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -1,20 +1,20 @@
#if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE rt_node_inner_4
-#define RT_NODE32_TYPE rt_node_inner_32
-#define RT_NODE125_TYPE rt_node_inner_125
-#define RT_NODE256_TYPE rt_node_inner_256
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
#elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE rt_node_leaf_4
-#define RT_NODE32_TYPE rt_node_leaf_32
-#define RT_NODE125_TYPE rt_node_leaf_125
-#define RT_NODE256_TYPE rt_node_leaf_256
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
#else
#error node level must be either inner or leaf
#endif
uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
bool chunk_exists = false;
- rt_node *newnode = NULL;
+ RT_NODE *newnode = NULL;
#ifdef RT_NODE_LEVEL_LEAF
Assert(NODE_IS_LEAF(node));
@@ -29,7 +29,7 @@
RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
int idx;
- idx = node_4_search_eq(&n4->base, chunk);
+ idx = RT_NODE_4_SEARCH_EQ(&n4->base, chunk);
if (idx != -1)
{
/* found the existing chunk */
@@ -47,22 +47,22 @@
RT_NODE32_TYPE *new32;
/* grow node from 4 to 32 */
- newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+ newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
new32 = (RT_NODE32_TYPE *) newnode;
#ifdef RT_NODE_LEVEL_LEAF
- chunk_values_array_copy(n4->base.chunks, n4->values,
+ RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
new32->base.chunks, new32->values);
#else
- chunk_children_array_copy(n4->base.chunks, n4->children,
+ RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
new32->base.chunks, new32->children);
#endif
Assert(parent != NULL);
- rt_replace_node(tree, parent, node, newnode, key);
+ RT_REPLACE_NODE(tree, parent, node, newnode, key);
node = newnode;
}
else
{
- int insertpos = node_4_get_insertpos(&n4->base, chunk);
+ int insertpos = RT_NODE_4_GET_INSERTPOS(&n4->base, chunk);
int count = n4->base.n.count;
/* shift chunks and children */
@@ -70,10 +70,10 @@
{
Assert(count > 0);
#ifdef RT_NODE_LEVEL_LEAF
- chunk_values_array_shift(n4->base.chunks, n4->values,
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n4->base.chunks, n4->values,
count, insertpos);
#else
- chunk_children_array_shift(n4->base.chunks, n4->children,
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n4->base.chunks, n4->children,
count, insertpos);
#endif
}
@@ -90,12 +90,12 @@
/* FALLTHROUGH */
case RT_NODE_KIND_32:
{
- const rt_size_class_elem minclass = rt_size_class_info[RT_CLASS_32_PARTIAL];
- const rt_size_class_elem maxclass = rt_size_class_info[RT_CLASS_32_FULL];
+ const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_PARTIAL];
+ const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_FULL];
RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
int idx;
- idx = node_32_search_eq(&n32->base, chunk);
+ idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
if (idx != -1)
{
/* found the existing chunk */
@@ -109,20 +109,20 @@
}
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
- n32->base.n.fanout == minclass.fanout)
+ n32->base.n.fanout == class32_min.fanout)
{
/* grow to the next size class of this kind */
#ifdef RT_NODE_LEVEL_LEAF
- newnode = rt_alloc_node(tree, RT_CLASS_32_FULL, false);
- memcpy(newnode, node, minclass.leaf_size);
+ newnode = RT_ALLOC_NODE(tree, RT_CLASS_32_FULL, false);
+ memcpy(newnode, node, class32_min.leaf_size);
#else
- newnode = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
- memcpy(newnode, node, minclass.inner_size);
+ newnode = RT_ALLOC_NODE(tree, RT_CLASS_32_FULL, true);
+ memcpy(newnode, node, class32_min.inner_size);
#endif
- newnode->fanout = maxclass.fanout;
+ newnode->fanout = class32_max.fanout;
Assert(parent != NULL);
- rt_replace_node(tree, parent, node, newnode, key);
+ RT_REPLACE_NODE(tree, parent, node, newnode, key);
node = newnode;
/* also update pointer for this kind */
@@ -133,13 +133,13 @@
{
RT_NODE125_TYPE *new125;
- Assert(n32->base.n.fanout == maxclass.fanout);
+ Assert(n32->base.n.fanout == class32_max.fanout);
/* grow node from 32 to 125 */
- newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+ newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
new125 = (RT_NODE125_TYPE *) newnode;
- for (int i = 0; i < maxclass.fanout; i++)
+ for (int i = 0; i < class32_max.fanout; i++)
{
new125->base.slot_idxs[n32->base.chunks[i]] = i;
#ifdef RT_NODE_LEVEL_LEAF
@@ -149,26 +149,26 @@
#endif
}
- Assert(maxclass.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
- new125->base.isset[0] = (bitmapword) (((uint64) 1 << maxclass.fanout) - 1);
+ Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+ new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
Assert(parent != NULL);
- rt_replace_node(tree, parent, node, newnode, key);
+ RT_REPLACE_NODE(tree, parent, node, newnode, key);
node = newnode;
}
else
{
- int insertpos = node_32_get_insertpos(&n32->base, chunk);
+ int insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
int count = n32->base.n.count;
if (insertpos < count)
{
Assert(count > 0);
#ifdef RT_NODE_LEVEL_LEAF
- chunk_values_array_shift(n32->base.chunks, n32->values,
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
count, insertpos);
#else
- chunk_children_array_shift(n32->base.chunks, n32->children,
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
count, insertpos);
#endif
}
@@ -206,22 +206,22 @@
RT_NODE256_TYPE *new256;
/* grow node from 125 to 256 */
- newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+ newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
new256 = (RT_NODE256_TYPE *) newnode;
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
{
- if (!node_125_is_chunk_used(&n125->base, i))
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
continue;
#ifdef RT_NODE_LEVEL_LEAF
- node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
+ RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
#else
- node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
+ RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
#endif
cnt++;
}
Assert(parent != NULL);
- rt_replace_node(tree, parent, node, newnode, key);
+ RT_REPLACE_NODE(tree, parent, node, newnode, key);
node = newnode;
}
else
@@ -260,16 +260,16 @@
RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
#ifdef RT_NODE_LEVEL_LEAF
- chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+ chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
#else
- chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+ chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
#endif
Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
#ifdef RT_NODE_LEVEL_LEAF
- node_leaf_256_set(n256, chunk, value);
+ RT_NODE_LEAF_256_SET(n256, chunk, value);
#else
- node_inner_256_set(n256, chunk, child);
+ RT_NODE_INNER_256_SET(n256, chunk, child);
#endif
break;
}
@@ -283,7 +283,7 @@
* Done. Finally, verify the chunk and value is inserted or replaced
* properly in the node.
*/
- rt_verify_node(node);
+ RT_VERIFY_NODE(node);
return chunk_exists;
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index bebf8e725a..a153011376 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -1,13 +1,13 @@
#if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE rt_node_inner_4
-#define RT_NODE32_TYPE rt_node_inner_32
-#define RT_NODE125_TYPE rt_node_inner_125
-#define RT_NODE256_TYPE rt_node_inner_256
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
#elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE rt_node_leaf_4
-#define RT_NODE32_TYPE rt_node_leaf_32
-#define RT_NODE125_TYPE rt_node_leaf_125
-#define RT_NODE256_TYPE rt_node_leaf_256
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
#else
#error node level must be either inner or leaf
#endif
@@ -15,7 +15,7 @@
#ifdef RT_NODE_LEVEL_LEAF
uint64 value;
#else
- rt_node *child = NULL;
+ RT_NODE *child = NULL;
#endif
bool found = false;
uint8 key_chunk;
@@ -62,7 +62,7 @@
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
{
- if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
break;
}
@@ -71,9 +71,9 @@
node_iter->current_idx = i;
#ifdef RT_NODE_LEVEL_LEAF
- value = node_leaf_125_get_value(n125, i);
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
#else
- child = node_inner_125_get_child(n125, i);
+ child = RT_NODE_INNER_125_GET_CHILD(n125, i);
#endif
key_chunk = i;
found = true;
@@ -87,9 +87,9 @@
for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
{
#ifdef RT_NODE_LEVEL_LEAF
- if (node_leaf_256_is_chunk_used(n256, i))
+ if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
#else
- if (node_inner_256_is_chunk_used(n256, i))
+ if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
#endif
break;
}
@@ -99,9 +99,9 @@
node_iter->current_idx = i;
#ifdef RT_NODE_LEVEL_LEAF
- value = node_leaf_256_get_value(n256, i);
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
#else
- child = node_inner_256_get_child(n256, i);
+ child = RT_NODE_INNER_256_GET_CHILD(n256, i);
#endif
key_chunk = i;
found = true;
@@ -111,7 +111,7 @@
if (found)
{
- rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
#ifdef RT_NODE_LEVEL_LEAF
*value_p = value;
#endif
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index d0366f9bb6..1a0d2d3f1f 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -1,13 +1,13 @@
#if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE rt_node_inner_4
-#define RT_NODE32_TYPE rt_node_inner_32
-#define RT_NODE125_TYPE rt_node_inner_125
-#define RT_NODE256_TYPE rt_node_inner_256
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
#elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE rt_node_leaf_4
-#define RT_NODE32_TYPE rt_node_leaf_32
-#define RT_NODE125_TYPE rt_node_leaf_125
-#define RT_NODE256_TYPE rt_node_leaf_256
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
#else
#error node level must be either inner or leaf
#endif
@@ -17,7 +17,7 @@
#ifdef RT_NODE_LEVEL_LEAF
uint64 value = 0;
#else
- rt_node *child = NULL;
+ RT_NODE *child = NULL;
#endif
switch (node->kind)
@@ -25,7 +25,7 @@
case RT_NODE_KIND_4:
{
RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
- int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ int idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
if (idx < 0)
return false;
@@ -40,7 +40,7 @@
case RT_NODE_KIND_32:
{
RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
- int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
if (idx < 0)
return false;
@@ -56,13 +56,13 @@
{
RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
- if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ if (!RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, chunk))
return false;
#ifdef RT_NODE_LEVEL_LEAF
- value = node_leaf_125_get_value(n125, chunk);
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
#else
- child = node_inner_125_get_child(n125, chunk);
+ child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
#endif
break;
}
@@ -71,16 +71,16 @@
RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
#ifdef RT_NODE_LEVEL_LEAF
- if (!node_leaf_256_is_chunk_used(n256, chunk))
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
#else
- if (!node_inner_256_is_chunk_used(n256, chunk))
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
#endif
return false;
#ifdef RT_NODE_LEVEL_LEAF
- value = node_leaf_256_get_value(n256, chunk);
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
#else
- child = node_inner_256_get_child(n256, chunk);
+ child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
#endif
break;
}
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index ea993e63df..2256d08100 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -14,7 +14,6 @@
#include "common/pg_prng.h"
#include "fmgr.h"
-#include "lib/radixtree.h"
#include "miscadmin.h"
#include "nodes/bitmapset.h"
#include "storage/block.h"
@@ -99,6 +98,14 @@ static const test_spec test_specs[] = {
}
};
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+
PG_MODULE_MAGIC;
PG_FUNCTION_INFO_V1(test_radixtree);
@@ -106,7 +113,7 @@ PG_FUNCTION_INFO_V1(test_radixtree);
static void
test_empty(void)
{
- radix_tree *radixtree;
+ rt_radix_tree *radixtree;
rt_iter *iter;
uint64 dummy;
uint64 key;
@@ -142,7 +149,7 @@ test_empty(void)
static void
test_basic(int children, bool test_inner)
{
- radix_tree *radixtree;
+ rt_radix_tree *radixtree;
uint64 *keys;
int shift = test_inner ? 8 : 0;
@@ -192,7 +199,7 @@ test_basic(int children, bool test_inner)
* Check if keys from start to end with the shift exist in the tree.
*/
static void
-check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end,
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
int incr)
{
for (int i = start; i < end; i++)
@@ -210,7 +217,7 @@ check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end,
}
static void
-test_node_types_insert(radix_tree *radixtree, uint8 shift, bool insert_asc)
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
{
uint64 num_entries;
int ninserted = 0;
@@ -257,7 +264,7 @@ test_node_types_insert(radix_tree *radixtree, uint8 shift, bool insert_asc)
}
static void
-test_node_types_delete(radix_tree *radixtree, uint8 shift)
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
{
uint64 num_entries;
@@ -288,7 +295,7 @@ test_node_types_delete(radix_tree *radixtree, uint8 shift)
static void
test_node_types(uint8 shift)
{
- radix_tree *radixtree;
+ rt_radix_tree *radixtree;
elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
@@ -312,7 +319,7 @@ test_node_types(uint8 shift)
static void
test_pattern(const test_spec * spec)
{
- radix_tree *radixtree;
+ rt_radix_tree *radixtree;
rt_iter *iter;
MemoryContext radixtree_ctx;
TimestampTz starttime;
--
2.39.0
v17-0006-Convert-radixtree.c-into-a-header.patchtext/x-patch; charset=US-ASCII; name=v17-0006-Convert-radixtree.c-into-a-header.patchDownload
From 45cad7dcb2c14e035ffd03ca59fcedaf51674bb4 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 4 Jan 2023 12:54:51 +0700
Subject: [PATCH v17 6/9] Convert radixtree.c into a header
Preparation for converting to a template.
---
src/backend/lib/Makefile | 1 -
src/backend/lib/meson.build | 1 -
src/backend/lib/radixtree.c | 1767 -----------------------------------
src/include/lib/radixtree.h | 1762 +++++++++++++++++++++++++++++++++-
4 files changed, 1753 insertions(+), 1778 deletions(-)
delete mode 100644 src/backend/lib/radixtree.c
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 4c1db794b6..9dad31398a 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,7 +22,6 @@ OBJS = \
integerset.o \
knapsack.o \
pairingheap.o \
- radixtree.o \
rbtree.o \
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 5f8df32c5c..974cab8776 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -11,5 +11,4 @@ backend_sources += files(
'knapsack.c',
'pairingheap.c',
'rbtree.c',
- 'radixtree.c',
)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
deleted file mode 100644
index 80cde09aaf..0000000000
--- a/src/backend/lib/radixtree.c
+++ /dev/null
@@ -1,1767 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * radixtree.c
- * Implementation for adaptive radix tree.
- *
- * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
- * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
- * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
- * types, each with a different numbers of elements. Depending on the number of
- * children, the appropriate node type is used.
- *
- * There are some differences from the proposed implementation. For instance,
- * there is not support for path compression and lazy path expansion. The radix
- * tree supports fixed length of the key so we don't expect the tree level
- * wouldn't be high.
- *
- * Both the key and the value are 64-bit unsigned integer. The inner nodes and
- * the leaf nodes have slightly different structure: for inner tree nodes,
- * shift > 0, store the pointer to its child node as the value. The leaf nodes,
- * shift == 0, have the 64-bit unsigned integer that is specified by the user as
- * the value. The paper refers to this technique as "Multi-value leaves". We
- * choose it to avoid an additional pointer traversal. It is the reason this code
- * currently does not support variable-length keys.
- *
- * XXX: Most functions in this file have two variants for inner nodes and leaf
- * nodes, therefore there are duplication codes. While this sometimes makes the
- * code maintenance tricky, this reduces branch prediction misses when judging
- * whether the node is a inner node of a leaf node.
- *
- * XXX: the radix tree node never be shrunk.
- *
- * Interface
- * ---------
- *
- * rt_create - Create a new, empty radix tree
- * rt_free - Free the radix tree
- * rt_search - Search a key-value pair
- * rt_set - Set a key-value pair
- * rt_delete - Delete a key-value pair
- * rt_begin_iterate - Begin iterating through all key-value pairs
- * rt_iterate_next - Return next key-value pair, if any
- * rt_end_iter - End iteration
- * rt_memory_usage - Get the memory usage
- * rt_num_entries - Get the number of key-value pairs
- *
- * rt_create() creates an empty radix tree in the given memory context
- * and memory contexts for all kinds of radix tree node under the memory context.
- *
- * rt_iterate_next() ensures returning key-value pairs in the ascending
- * order of the key.
- *
- * Copyright (c) 2022, PostgreSQL Global Development Group
- *
- * IDENTIFICATION
- * src/backend/lib/radixtree.c
- *
- *-------------------------------------------------------------------------
- */
-
-#include "postgres.h"
-
-#include "lib/radixtree.h"
-#include "lib/stringinfo.h"
-#include "miscadmin.h"
-#include "nodes/bitmapset.h"
-#include "port/pg_bitutils.h"
-#include "port/pg_lfind.h"
-#include "utils/memutils.h"
-
-#ifdef RT_DEBUG
-#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
-#endif
-
-/* The number of bits encoded in one tree level */
-#define RT_NODE_SPAN BITS_PER_BYTE
-
-/* The number of maximum slots in the node */
-#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
-
-/* Mask for extracting a chunk from the key */
-#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
-
-/* Maximum shift the radix tree uses */
-#define RT_MAX_SHIFT key_get_shift(UINT64_MAX)
-
-/* Tree level the radix tree uses */
-#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
-
-/* Invalid index used in node-125 */
-#define RT_NODE_125_INVALID_IDX 0xFF
-
-/* Get a chunk from the key */
-#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
-
-/* For accessing bitmaps */
-#define BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
-#define BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
-
-/*
- * Supported radix tree node kinds and size classes.
- *
- * There are 4 node kinds and each node kind have one or two size classes,
- * partial and full. The size classes in the same node kind have the same
- * node structure but have the different number of fanout that is stored
- * in 'fanout' of rt_node. For example in size class 15, when a 16th element
- * is to be inserted, we allocate a larger area and memcpy the entire old
- * node to it.
- *
- * This technique allows us to limit the node kinds to 4, which limits the
- * number of cases in switch statements. It also allows a possible future
- * optimization to encode the node kind in a pointer tag.
- *
- * These size classes have been chose carefully so that it minimizes the
- * allocator padding in both the inner and leaf nodes on DSA.
- * node
- */
-#define RT_NODE_KIND_4 0x00
-#define RT_NODE_KIND_32 0x01
-#define RT_NODE_KIND_125 0x02
-#define RT_NODE_KIND_256 0x03
-#define RT_NODE_KIND_COUNT 4
-
-typedef enum rt_size_class
-{
- RT_CLASS_4_FULL = 0,
- RT_CLASS_32_PARTIAL,
- RT_CLASS_32_FULL,
- RT_CLASS_125_FULL,
- RT_CLASS_256
-
-#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
-} rt_size_class;
-
-/* Common type for all nodes types */
-typedef struct rt_node
-{
- /*
- * Number of children. We use uint16 to be able to indicate 256 children
- * at the fanout of 8.
- */
- uint16 count;
-
- /* Max number of children. We can use uint8 because we never need to store 256 */
- /* WIP: if we don't have a variable sized node4, this should instead be in the base
- types as needed, since saving every byte is crucial for the smallest node kind */
- uint8 fanout;
-
- /*
- * Shift indicates which part of the key space is represented by this
- * node. That is, the key is shifted by 'shift' and the lowest
- * RT_NODE_SPAN bits are then represented in chunk.
- */
- uint8 shift;
- uint8 chunk;
-
- /* Node kind, one per search/set algorithm */
- uint8 kind;
-} rt_node;
-#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
-#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
-#define VAR_NODE_HAS_FREE_SLOT(node) \
- ((node)->base.n.count < (node)->base.n.fanout)
-#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
- ((node)->base.n.count < rt_size_class_info[class].fanout)
-
-/* Base type of each node kinds for leaf and inner nodes */
-/* The base types must be a be able to accommodate the largest size
-class for variable-sized node kinds*/
-typedef struct rt_node_base_4
-{
- rt_node n;
-
- /* 4 children, for key chunks */
- uint8 chunks[4];
-} rt_node_base_4;
-
-typedef struct rt_node_base32
-{
- rt_node n;
-
- /* 32 children, for key chunks */
- uint8 chunks[32];
-} rt_node_base_32;
-
-/*
- * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
- * 256, to store indexes into a second array that contains up to 125 values (or
- * child pointers in inner nodes).
- */
-typedef struct rt_node_base125
-{
- rt_node n;
-
- /* The index of slots for each fanout */
- uint8 slot_idxs[RT_NODE_MAX_SLOTS];
-
- /* isset is a bitmap to track which slot is in use */
- bitmapword isset[BM_IDX(128)];
-} rt_node_base_125;
-
-typedef struct rt_node_base256
-{
- rt_node n;
-} rt_node_base_256;
-
-/*
- * Inner and leaf nodes.
- *
- * Theres are separate for two main reasons:
- *
- * 1) the value type might be different than something fitting into a pointer
- * width type
- * 2) Need to represent non-existing values in a key-type independent way.
- *
- * 1) is clearly worth being concerned about, but it's not clear 2) is as
- * good. It might be better to just indicate non-existing entries the same way
- * in inner nodes.
- */
-typedef struct rt_node_inner_4
-{
- rt_node_base_4 base;
-
- /* number of children depends on size class */
- rt_node *children[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_inner_4;
-
-typedef struct rt_node_leaf_4
-{
- rt_node_base_4 base;
-
- /* number of values depends on size class */
- uint64 values[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_leaf_4;
-
-typedef struct rt_node_inner_32
-{
- rt_node_base_32 base;
-
- /* number of children depends on size class */
- rt_node *children[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_inner_32;
-
-typedef struct rt_node_leaf_32
-{
- rt_node_base_32 base;
-
- /* number of values depends on size class */
- uint64 values[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_leaf_32;
-
-typedef struct rt_node_inner_125
-{
- rt_node_base_125 base;
-
- /* number of children depends on size class */
- rt_node *children[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_inner_125;
-
-typedef struct rt_node_leaf_125
-{
- rt_node_base_125 base;
-
- /* number of values depends on size class */
- uint64 values[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_leaf_125;
-
-/*
- * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
- * for directly storing values (or child pointers in inner nodes).
- */
-typedef struct rt_node_inner_256
-{
- rt_node_base_256 base;
-
- /* Slots for 256 children */
- rt_node *children[RT_NODE_MAX_SLOTS];
-} rt_node_inner_256;
-
-typedef struct rt_node_leaf_256
-{
- rt_node_base_256 base;
-
- /* isset is a bitmap to track which slot is in use */
- bitmapword isset[BM_IDX(RT_NODE_MAX_SLOTS)];
-
- /* Slots for 256 values */
- uint64 values[RT_NODE_MAX_SLOTS];
-} rt_node_leaf_256;
-
-/* Information for each size class */
-typedef struct rt_size_class_elem
-{
- const char *name;
- int fanout;
-
- /* slab chunk size */
- Size inner_size;
- Size leaf_size;
-
- /* slab block size */
- Size inner_blocksize;
- Size leaf_blocksize;
-} rt_size_class_elem;
-
-/*
- * Calculate the slab blocksize so that we can allocate at least 32 chunks
- * from the block.
- */
-#define NODE_SLAB_BLOCK_SIZE(size) \
- Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
-static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
- [RT_CLASS_4_FULL] = {
- .name = "radix tree node 4",
- .fanout = 4,
- .inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
- .leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
- },
- [RT_CLASS_32_PARTIAL] = {
- .name = "radix tree node 15",
- .fanout = 15,
- .inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
- .leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
- },
- [RT_CLASS_32_FULL] = {
- .name = "radix tree node 32",
- .fanout = 32,
- .inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
- .leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
- },
- [RT_CLASS_125_FULL] = {
- .name = "radix tree node 125",
- .fanout = 125,
- .inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
- .leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
- },
- [RT_CLASS_256] = {
- .name = "radix tree node 256",
- .fanout = 256,
- .inner_size = sizeof(rt_node_inner_256),
- .leaf_size = sizeof(rt_node_leaf_256),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
- },
-};
-
-/* Map from the node kind to its minimum size class */
-static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
- [RT_NODE_KIND_4] = RT_CLASS_4_FULL,
- [RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
- [RT_NODE_KIND_125] = RT_CLASS_125_FULL,
- [RT_NODE_KIND_256] = RT_CLASS_256,
-};
-
-/*
- * Iteration support.
- *
- * Iterating the radix tree returns each pair of key and value in the ascending
- * order of the key. To support this, the we iterate nodes of each level.
- *
- * rt_node_iter struct is used to track the iteration within a node.
- *
- * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
- * in order to track the iteration of each level. During the iteration, we also
- * construct the key whenever updating the node iteration information, e.g., when
- * advancing the current index within the node or when moving to the next node
- * at the same level.
- */
-typedef struct rt_node_iter
-{
- rt_node *node; /* current node being iterated */
- int current_idx; /* current position. -1 for initial value */
-} rt_node_iter;
-
-struct rt_iter
-{
- radix_tree *tree;
-
- /* Track the iteration on nodes of each level */
- rt_node_iter stack[RT_MAX_LEVEL];
- int stack_len;
-
- /* The key is being constructed during the iteration */
- uint64 key;
-};
-
-/* A radix tree with nodes */
-struct radix_tree
-{
- MemoryContext context;
-
- rt_node *root;
- uint64 max_val;
- uint64 num_keys;
-
- MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
- MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
-
- /* statistics */
-#ifdef RT_DEBUG
- int32 cnt[RT_SIZE_CLASS_COUNT];
-#endif
-};
-
-static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
-static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
- bool inner);
-static void rt_free_node(radix_tree *tree, rt_node *node);
-static void rt_extend(radix_tree *tree, uint64 key);
-static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p);
-static inline bool rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p);
-static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
- uint64 key, rt_node *child);
-static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
- uint64 key, uint64 value);
-static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
-static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
- uint64 *value_p);
-static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
-static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
-
-/* verification (available only with assertion) */
-static void rt_verify_node(rt_node *node);
-
-/*
- * Return index of the first element in 'base' that equals 'key'. Return -1
- * if there is no such element.
- */
-static inline int
-node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
-{
- int idx = -1;
-
- for (int i = 0; i < node->n.count; i++)
- {
- if (node->chunks[i] == chunk)
- {
- idx = i;
- break;
- }
- }
-
- return idx;
-}
-
-/*
- * Return index of the chunk to insert into chunks in the given node.
- */
-static inline int
-node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
-{
- int idx;
-
- for (idx = 0; idx < node->n.count; idx++)
- {
- if (node->chunks[idx] >= chunk)
- break;
- }
-
- return idx;
-}
-
-/*
- * Return index of the first element in 'base' that equals 'key'. Return -1
- * if there is no such element.
- */
-static inline int
-node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
-{
- int count = node->n.count;
-#ifndef USE_NO_SIMD
- Vector8 spread_chunk;
- Vector8 haystack1;
- Vector8 haystack2;
- Vector8 cmp1;
- Vector8 cmp2;
- uint32 bitfield;
- int index_simd = -1;
-#endif
-
-#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
- int index = -1;
-
- for (int i = 0; i < count; i++)
- {
- if (node->chunks[i] == chunk)
- {
- index = i;
- break;
- }
- }
-#endif
-
-#ifndef USE_NO_SIMD
- spread_chunk = vector8_broadcast(chunk);
- vector8_load(&haystack1, &node->chunks[0]);
- vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
- cmp1 = vector8_eq(spread_chunk, haystack1);
- cmp2 = vector8_eq(spread_chunk, haystack2);
- bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
- bitfield &= ((UINT64CONST(1) << count) - 1);
-
- if (bitfield)
- index_simd = pg_rightmost_one_pos32(bitfield);
-
- Assert(index_simd == index);
- return index_simd;
-#else
- return index;
-#endif
-}
-
-/*
- * Return index of the chunk to insert into chunks in the given node.
- */
-static inline int
-node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
-{
- int count = node->n.count;
-#ifndef USE_NO_SIMD
- Vector8 spread_chunk;
- Vector8 haystack1;
- Vector8 haystack2;
- Vector8 cmp1;
- Vector8 cmp2;
- Vector8 min1;
- Vector8 min2;
- uint32 bitfield;
- int index_simd;
-#endif
-
-#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
- int index;
-
- for (index = 0; index < count; index++)
- {
- if (node->chunks[index] >= chunk)
- break;
- }
-#endif
-
-#ifndef USE_NO_SIMD
- spread_chunk = vector8_broadcast(chunk);
- vector8_load(&haystack1, &node->chunks[0]);
- vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
- min1 = vector8_min(spread_chunk, haystack1);
- min2 = vector8_min(spread_chunk, haystack2);
- cmp1 = vector8_eq(spread_chunk, min1);
- cmp2 = vector8_eq(spread_chunk, min2);
- bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
- bitfield &= ((UINT64CONST(1) << count) - 1);
-
- if (bitfield)
- index_simd = pg_rightmost_one_pos32(bitfield);
- else
- index_simd = count;
-
- Assert(index_simd == index);
- return index_simd;
-#else
- return index;
-#endif
-}
-
-/*
- * Functions to manipulate both chunks array and children/values array.
- * These are used for node-4 and node-32.
- */
-
-/* Shift the elements right at 'idx' by one */
-static inline void
-chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
-{
- memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
- memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
-}
-
-static inline void
-chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
-{
- memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
- memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
-}
-
-/* Delete the element at 'idx' */
-static inline void
-chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
-{
- memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
- memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
-}
-
-static inline void
-chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
-{
- memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
- memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
-}
-
-/* Copy both chunks and children/values arrays */
-static inline void
-chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
- uint8 *dst_chunks, rt_node **dst_children)
-{
- const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
- const Size chunk_size = sizeof(uint8) * fanout;
- const Size children_size = sizeof(rt_node *) * fanout;
-
- memcpy(dst_chunks, src_chunks, chunk_size);
- memcpy(dst_children, src_children, children_size);
-}
-
-static inline void
-chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
- uint8 *dst_chunks, uint64 *dst_values)
-{
- const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
- const Size chunk_size = sizeof(uint8) * fanout;
- const Size values_size = sizeof(uint64) * fanout;
-
- memcpy(dst_chunks, src_chunks, chunk_size);
- memcpy(dst_values, src_values, values_size);
-}
-
-/* Functions to manipulate inner and leaf node-125 */
-
-/* Does the given chunk in the node has the value? */
-static inline bool
-node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
-{
- return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
-}
-
-static inline rt_node *
-node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
-{
- Assert(!NODE_IS_LEAF(node));
- return node->children[node->base.slot_idxs[chunk]];
-}
-
-static inline uint64
-node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
-{
- Assert(NODE_IS_LEAF(node));
- Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
- return node->values[node->base.slot_idxs[chunk]];
-}
-
-/* Functions to manipulate inner and leaf node-256 */
-
-/* Return true if the slot corresponding to the given chunk is in use */
-static inline bool
-node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
-{
- Assert(!NODE_IS_LEAF(node));
- return (node->children[chunk] != NULL);
-}
-
-static inline bool
-node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
-{
- int idx = BM_IDX(chunk);
- int bitnum = BM_BIT(chunk);
-
- Assert(NODE_IS_LEAF(node));
- return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
-}
-
-static inline rt_node *
-node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
-{
- Assert(!NODE_IS_LEAF(node));
- Assert(node_inner_256_is_chunk_used(node, chunk));
- return node->children[chunk];
-}
-
-static inline uint64
-node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
-{
- Assert(NODE_IS_LEAF(node));
- Assert(node_leaf_256_is_chunk_used(node, chunk));
- return node->values[chunk];
-}
-
-/* Set the child in the node-256 */
-static inline void
-node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
-{
- Assert(!NODE_IS_LEAF(node));
- node->children[chunk] = child;
-}
-
-/* Set the value in the node-256 */
-static inline void
-node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
-{
- int idx = BM_IDX(chunk);
- int bitnum = BM_BIT(chunk);
-
- Assert(NODE_IS_LEAF(node));
- node->isset[idx] |= ((bitmapword) 1 << bitnum);
- node->values[chunk] = value;
-}
-
-/* Set the slot at the given chunk position */
-static inline void
-node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
-{
- Assert(!NODE_IS_LEAF(node));
- node->children[chunk] = NULL;
-}
-
-static inline void
-node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
-{
- int idx = BM_IDX(chunk);
- int bitnum = BM_BIT(chunk);
-
- Assert(NODE_IS_LEAF(node));
- node->isset[idx] &= ~((bitmapword) 1 << bitnum);
-}
-
-/*
- * Return the shift that is satisfied to store the given key.
- */
-static inline int
-key_get_shift(uint64 key)
-{
- return (key == 0)
- ? 0
- : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
-}
-
-/*
- * Return the max value stored in a node with the given shift.
- */
-static uint64
-shift_get_max_val(int shift)
-{
- if (shift == RT_MAX_SHIFT)
- return UINT64_MAX;
-
- return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
-}
-
-/*
- * Create a new node as the root. Subordinate nodes will be created during
- * the insertion.
- */
-static void
-rt_new_root(radix_tree *tree, uint64 key)
-{
- int shift = key_get_shift(key);
- bool inner = shift > 0;
- rt_node *newnode;
-
- newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
- rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
- newnode->shift = shift;
- tree->max_val = shift_get_max_val(shift);
- tree->root = newnode;
-}
-
-/*
- * Allocate a new node with the given node kind.
- */
-static rt_node *
-rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
-{
- rt_node *newnode;
-
- if (inner)
- newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
- rt_size_class_info[size_class].inner_size);
- else
- newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
- rt_size_class_info[size_class].leaf_size);
-
-#ifdef RT_DEBUG
- /* update the statistics */
- tree->cnt[size_class]++;
-#endif
-
- return newnode;
-}
-
-/* Initialize the node contents */
-static inline void
-rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
-{
- if (inner)
- MemSet(node, 0, rt_size_class_info[size_class].inner_size);
- else
- MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
-
- node->kind = kind;
- node->fanout = rt_size_class_info[size_class].fanout;
-
- /* Initialize slot_idxs to invalid values */
- if (kind == RT_NODE_KIND_125)
- {
- rt_node_base_125 *n125 = (rt_node_base_125 *) node;
-
- memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
- }
-
- /*
- * Technically it's 256, but we cannot store that in a uint8,
- * and this is the max size class to it will never grow.
- */
- if (kind == RT_NODE_KIND_256)
- node->fanout = 0;
-}
-
-static inline void
-rt_copy_node(rt_node *newnode, rt_node *oldnode)
-{
- newnode->shift = oldnode->shift;
- newnode->chunk = oldnode->chunk;
- newnode->count = oldnode->count;
-}
-
-/*
- * Create a new node with 'new_kind' and the same shift, chunk, and
- * count of 'node'.
- */
-static rt_node*
-rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
-{
- rt_node *newnode;
- bool inner = !NODE_IS_LEAF(node);
-
- newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
- rt_init_node(newnode, new_kind, kind_min_size_class[new_kind], inner);
- rt_copy_node(newnode, node);
-
- return newnode;
-}
-
-/* Free the given node */
-static void
-rt_free_node(radix_tree *tree, rt_node *node)
-{
- /* If we're deleting the root node, make the tree empty */
- if (tree->root == node)
- {
- tree->root = NULL;
- tree->max_val = 0;
- }
-
-#ifdef RT_DEBUG
- {
- int i;
-
- /* update the statistics */
- for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
- {
- if (node->fanout == rt_size_class_info[i].fanout)
- break;
- }
-
- /* fanout of node256 is intentionally 0 */
- if (i == RT_SIZE_CLASS_COUNT)
- i = RT_CLASS_256;
-
- tree->cnt[i]--;
- Assert(tree->cnt[i] >= 0);
- }
-#endif
-
- pfree(node);
-}
-
-/*
- * Replace old_child with new_child, and free the old one.
- */
-static void
-rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
- rt_node *new_child, uint64 key)
-{
- Assert(old_child->chunk == new_child->chunk);
- Assert(old_child->shift == new_child->shift);
-
- if (parent == old_child)
- {
- /* Replace the root node with the new large node */
- tree->root = new_child;
- }
- else
- {
- bool replaced PG_USED_FOR_ASSERTS_ONLY;
-
- replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
- Assert(replaced);
- }
-
- rt_free_node(tree, old_child);
-}
-
-/*
- * The radix tree doesn't sufficient height. Extend the radix tree so it can
- * store the key.
- */
-static void
-rt_extend(radix_tree *tree, uint64 key)
-{
- int target_shift;
- int shift = tree->root->shift + RT_NODE_SPAN;
-
- target_shift = key_get_shift(key);
-
- /* Grow tree from 'shift' to 'target_shift' */
- while (shift <= target_shift)
- {
- rt_node_inner_4 *node;
-
- node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
- rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
- node->base.n.shift = shift;
- node->base.n.count = 1;
- node->base.chunks[0] = 0;
- node->children[0] = tree->root;
-
- tree->root->chunk = 0;
- tree->root = (rt_node *) node;
-
- shift += RT_NODE_SPAN;
- }
-
- tree->max_val = shift_get_max_val(target_shift);
-}
-
-/*
- * The radix tree doesn't have inner and leaf nodes for given key-value pair.
- * Insert inner and leaf nodes from 'node' to bottom.
- */
-static inline void
-rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
- rt_node *node)
-{
- int shift = node->shift;
-
- while (shift >= RT_NODE_SPAN)
- {
- rt_node *newchild;
- int newshift = shift - RT_NODE_SPAN;
- bool inner = newshift > 0;
-
- newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
- rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
- newchild->shift = newshift;
- newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
- rt_node_insert_inner(tree, parent, node, key, newchild);
-
- parent = node;
- node = newchild;
- shift -= RT_NODE_SPAN;
- }
-
- rt_node_insert_leaf(tree, parent, node, key, value);
- tree->num_keys++;
-}
-
-/*
- * Search for the child pointer corresponding to 'key' in the given node.
- *
- * Return true if the key is found, otherwise return false. On success, the child
- * pointer is set to child_p.
- */
-static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p)
-{
-#define RT_NODE_LEVEL_INNER
-#include "lib/radixtree_search_impl.h"
-#undef RT_NODE_LEVEL_INNER
-}
-
-/*
- * Search for the value corresponding to 'key' in the given node.
- *
- * Return true if the key is found, otherwise return false. On success, the pointer
- * to the value is set to value_p.
- */
-static inline bool
-rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p)
-{
-#define RT_NODE_LEVEL_LEAF
-#include "lib/radixtree_search_impl.h"
-#undef RT_NODE_LEVEL_LEAF
-}
-
-/*
- * Search for the child pointer corresponding to 'key' in the given node.
- *
- * Delete the node and return true if the key is found, otherwise return false.
- */
-static inline bool
-rt_node_delete_inner(rt_node *node, uint64 key)
-{
-#define RT_NODE_LEVEL_INNER
-#include "lib/radixtree_delete_impl.h"
-#undef RT_NODE_LEVEL_INNER
-}
-
-/*
- * Search for the value corresponding to 'key' in the given node.
- *
- * Delete the node and return true if the key is found, otherwise return false.
- */
-static inline bool
-rt_node_delete_leaf(rt_node *node, uint64 key)
-{
-#define RT_NODE_LEVEL_LEAF
-#include "lib/radixtree_delete_impl.h"
-#undef RT_NODE_LEVEL_LEAF
-}
-
-/* Insert the child to the inner node */
-static bool
-rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
- rt_node *child)
-{
-#define RT_NODE_LEVEL_INNER
-#include "lib/radixtree_insert_impl.h"
-#undef RT_NODE_LEVEL_INNER
-}
-
-/* Insert the value to the leaf node */
-static bool
-rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
- uint64 key, uint64 value)
-{
-#define RT_NODE_LEVEL_LEAF
-#include "lib/radixtree_insert_impl.h"
-#undef RT_NODE_LEVEL_LEAF
-}
-
-/*
- * Create the radix tree in the given memory context and return it.
- */
-radix_tree *
-rt_create(MemoryContext ctx)
-{
- radix_tree *tree;
- MemoryContext old_ctx;
-
- old_ctx = MemoryContextSwitchTo(ctx);
-
- tree = palloc(sizeof(radix_tree));
- tree->context = ctx;
- tree->root = NULL;
- tree->max_val = 0;
- tree->num_keys = 0;
-
- /* Create the slab allocator for each size class */
- for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
- {
- tree->inner_slabs[i] = SlabContextCreate(ctx,
- rt_size_class_info[i].name,
- rt_size_class_info[i].inner_blocksize,
- rt_size_class_info[i].inner_size);
- tree->leaf_slabs[i] = SlabContextCreate(ctx,
- rt_size_class_info[i].name,
- rt_size_class_info[i].leaf_blocksize,
- rt_size_class_info[i].leaf_size);
-#ifdef RT_DEBUG
- tree->cnt[i] = 0;
-#endif
- }
-
- MemoryContextSwitchTo(old_ctx);
-
- return tree;
-}
-
-/*
- * Free the given radix tree.
- */
-void
-rt_free(radix_tree *tree)
-{
- for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
- {
- MemoryContextDelete(tree->inner_slabs[i]);
- MemoryContextDelete(tree->leaf_slabs[i]);
- }
-
- pfree(tree);
-}
-
-/*
- * Set key to value. If the entry already exists, we update its value to 'value'
- * and return true. Returns false if entry doesn't yet exist.
- */
-bool
-rt_set(radix_tree *tree, uint64 key, uint64 value)
-{
- int shift;
- bool updated;
- rt_node *node;
- rt_node *parent;
-
- /* Empty tree, create the root */
- if (!tree->root)
- rt_new_root(tree, key);
-
- /* Extend the tree if necessary */
- if (key > tree->max_val)
- rt_extend(tree, key);
-
- Assert(tree->root);
-
- shift = tree->root->shift;
- node = parent = tree->root;
-
- /* Descend the tree until a leaf node */
- while (shift >= 0)
- {
- rt_node *child;
-
- if (NODE_IS_LEAF(node))
- break;
-
- if (!rt_node_search_inner(node, key, &child))
- {
- rt_set_extend(tree, key, value, parent, node);
- return false;
- }
-
- parent = node;
- node = child;
- shift -= RT_NODE_SPAN;
- }
-
- updated = rt_node_insert_leaf(tree, parent, node, key, value);
-
- /* Update the statistics */
- if (!updated)
- tree->num_keys++;
-
- return updated;
-}
-
-/*
- * Search the given key in the radix tree. Return true if there is the key,
- * otherwise return false. On success, we set the value to *val_p so it must
- * not be NULL.
- */
-bool
-rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
-{
- rt_node *node;
- int shift;
-
- Assert(value_p != NULL);
-
- if (!tree->root || key > tree->max_val)
- return false;
-
- node = tree->root;
- shift = tree->root->shift;
-
- /* Descend the tree until a leaf node */
- while (shift >= 0)
- {
- rt_node *child;
-
- if (NODE_IS_LEAF(node))
- break;
-
- if (!rt_node_search_inner(node, key, &child))
- return false;
-
- node = child;
- shift -= RT_NODE_SPAN;
- }
-
- return rt_node_search_leaf(node, key, value_p);
-}
-
-/*
- * Delete the given key from the radix tree. Return true if the key is found (and
- * deleted), otherwise do nothing and return false.
- */
-bool
-rt_delete(radix_tree *tree, uint64 key)
-{
- rt_node *node;
- rt_node *stack[RT_MAX_LEVEL] = {0};
- int shift;
- int level;
- bool deleted;
-
- if (!tree->root || key > tree->max_val)
- return false;
-
- /*
- * Descend the tree to search the key while building a stack of nodes we
- * visited.
- */
- node = tree->root;
- shift = tree->root->shift;
- level = -1;
- while (shift > 0)
- {
- rt_node *child;
-
- /* Push the current node to the stack */
- stack[++level] = node;
-
- if (!rt_node_search_inner(node, key, &child))
- return false;
-
- node = child;
- shift -= RT_NODE_SPAN;
- }
-
- /* Delete the key from the leaf node if exists */
- Assert(NODE_IS_LEAF(node));
- deleted = rt_node_delete_leaf(node, key);
-
- if (!deleted)
- {
- /* no key is found in the leaf node */
- return false;
- }
-
- /* Found the key to delete. Update the statistics */
- tree->num_keys--;
-
- /*
- * Return if the leaf node still has keys and we don't need to delete the
- * node.
- */
- if (!NODE_IS_EMPTY(node))
- return true;
-
- /* Free the empty leaf node */
- rt_free_node(tree, node);
-
- /* Delete the key in inner nodes recursively */
- while (level >= 0)
- {
- node = stack[level--];
-
- deleted = rt_node_delete_inner(node, key);
- Assert(deleted);
-
- /* If the node didn't become empty, we stop deleting the key */
- if (!NODE_IS_EMPTY(node))
- break;
-
- /* The node became empty */
- rt_free_node(tree, node);
- }
-
- return true;
-}
-
-/* Create and return the iterator for the given radix tree */
-rt_iter *
-rt_begin_iterate(radix_tree *tree)
-{
- MemoryContext old_ctx;
- rt_iter *iter;
- int top_level;
-
- old_ctx = MemoryContextSwitchTo(tree->context);
-
- iter = (rt_iter *) palloc0(sizeof(rt_iter));
- iter->tree = tree;
-
- /* empty tree */
- if (!iter->tree->root)
- return iter;
-
- top_level = iter->tree->root->shift / RT_NODE_SPAN;
- iter->stack_len = top_level;
-
- /*
- * Descend to the left most leaf node from the root. The key is being
- * constructed while descending to the leaf.
- */
- rt_update_iter_stack(iter, iter->tree->root, top_level);
-
- MemoryContextSwitchTo(old_ctx);
-
- return iter;
-}
-
-/*
- * Update each node_iter for inner nodes in the iterator node stack.
- */
-static void
-rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
-{
- int level = from;
- rt_node *node = from_node;
-
- for (;;)
- {
- rt_node_iter *node_iter = &(iter->stack[level--]);
-
- node_iter->node = node;
- node_iter->current_idx = -1;
-
- /* We don't advance the leaf node iterator here */
- if (NODE_IS_LEAF(node))
- return;
-
- /* Advance to the next slot in the inner node */
- node = rt_node_inner_iterate_next(iter, node_iter);
-
- /* We must find the first children in the node */
- Assert(node);
- }
-}
-
-/*
- * Return true with setting key_p and value_p if there is next key. Otherwise,
- * return false.
- */
-bool
-rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
-{
- /* Empty tree */
- if (!iter->tree->root)
- return false;
-
- for (;;)
- {
- rt_node *child = NULL;
- uint64 value;
- int level;
- bool found;
-
- /* Advance the leaf node iterator to get next key-value pair */
- found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
-
- if (found)
- {
- *key_p = iter->key;
- *value_p = value;
- return true;
- }
-
- /*
- * We've visited all values in the leaf node, so advance inner node
- * iterators from the level=1 until we find the next child node.
- */
- for (level = 1; level <= iter->stack_len; level++)
- {
- child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
-
- if (child)
- break;
- }
-
- /* the iteration finished */
- if (!child)
- return false;
-
- /*
- * Set the node to the node iterator and update the iterator stack
- * from this node.
- */
- rt_update_iter_stack(iter, child, level - 1);
-
- /* Node iterators are updated, so try again from the leaf */
- }
-
- return false;
-}
-
-void
-rt_end_iterate(rt_iter *iter)
-{
- pfree(iter);
-}
-
-static inline void
-rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
-{
- iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
- iter->key |= (((uint64) chunk) << shift);
-}
-
-/*
- * Advance the slot in the inner node. Return the child if exists, otherwise
- * null.
- */
-static inline rt_node *
-rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
-{
-#define RT_NODE_LEVEL_INNER
-#include "lib/radixtree_iter_impl.h"
-#undef RT_NODE_LEVEL_INNER
-}
-
-/*
- * Advance the slot in the leaf node. On success, return true and the value
- * is set to value_p, otherwise return false.
- */
-static inline bool
-rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
- uint64 *value_p)
-{
-#define RT_NODE_LEVEL_LEAF
-#include "lib/radixtree_iter_impl.h"
-#undef RT_NODE_LEVEL_LEAF
-}
-
-/*
- * Return the number of keys in the radix tree.
- */
-uint64
-rt_num_entries(radix_tree *tree)
-{
- return tree->num_keys;
-}
-
-/*
- * Return the statistics of the amount of memory used by the radix tree.
- */
-uint64
-rt_memory_usage(radix_tree *tree)
-{
- Size total = sizeof(radix_tree);
-
- for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
- {
- total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
- total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
- }
-
- return total;
-}
-
-/*
- * Verify the radix tree node.
- */
-static void
-rt_verify_node(rt_node *node)
-{
-#ifdef USE_ASSERT_CHECKING
- Assert(node->count >= 0);
-
- switch (node->kind)
- {
- case RT_NODE_KIND_4:
- {
- rt_node_base_4 *n4 = (rt_node_base_4 *) node;
-
- for (int i = 1; i < n4->n.count; i++)
- Assert(n4->chunks[i - 1] < n4->chunks[i]);
-
- break;
- }
- case RT_NODE_KIND_32:
- {
- rt_node_base_32 *n32 = (rt_node_base_32 *) node;
-
- for (int i = 1; i < n32->n.count; i++)
- Assert(n32->chunks[i - 1] < n32->chunks[i]);
-
- break;
- }
- case RT_NODE_KIND_125:
- {
- rt_node_base_125 *n125 = (rt_node_base_125 *) node;
- int cnt = 0;
-
- for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
- {
- uint8 slot = n125->slot_idxs[i];
- int idx = BM_IDX(slot);
- int bitnum = BM_BIT(slot);
-
- if (!node_125_is_chunk_used(n125, i))
- continue;
-
- /* Check if the corresponding slot is used */
- Assert(slot < node->fanout);
- Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
-
- cnt++;
- }
-
- Assert(n125->n.count == cnt);
- break;
- }
- case RT_NODE_KIND_256:
- {
- if (NODE_IS_LEAF(node))
- {
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
- int cnt = 0;
-
- for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
- cnt += bmw_popcount(n256->isset[i]);
-
- /* Check if the number of used chunk matches */
- Assert(n256->base.n.count == cnt);
-
- break;
- }
- }
- }
-#endif
-}
-
-/***************** DEBUG FUNCTIONS *****************/
-#ifdef RT_DEBUG
-void
-rt_stats(radix_tree *tree)
-{
- ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
- tree->num_keys,
- tree->root->shift / RT_NODE_SPAN,
- tree->cnt[RT_CLASS_4_FULL],
- tree->cnt[RT_CLASS_32_PARTIAL],
- tree->cnt[RT_CLASS_32_FULL],
- tree->cnt[RT_CLASS_125_FULL],
- tree->cnt[RT_CLASS_256])));
-}
-
-static void
-rt_dump_node(rt_node *node, int level, bool recurse)
-{
- char space[125] = {0};
-
- fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
- NODE_IS_LEAF(node) ? "LEAF" : "INNR",
- (node->kind == RT_NODE_KIND_4) ? 4 :
- (node->kind == RT_NODE_KIND_32) ? 32 :
- (node->kind == RT_NODE_KIND_125) ? 125 : 256,
- node->fanout == 0 ? 256 : node->fanout,
- node->count, node->shift, node->chunk);
-
- if (level > 0)
- sprintf(space, "%*c", level * 4, ' ');
-
- switch (node->kind)
- {
- case RT_NODE_KIND_4:
- {
- for (int i = 0; i < node->count; i++)
- {
- if (NODE_IS_LEAF(node))
- {
- rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
-
- fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
- space, n4->base.chunks[i], n4->values[i]);
- }
- else
- {
- rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
-
- fprintf(stderr, "%schunk 0x%X ->",
- space, n4->base.chunks[i]);
-
- if (recurse)
- rt_dump_node(n4->children[i], level + 1, recurse);
- else
- fprintf(stderr, "\n");
- }
- }
- break;
- }
- case RT_NODE_KIND_32:
- {
- for (int i = 0; i < node->count; i++)
- {
- if (NODE_IS_LEAF(node))
- {
- rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
-
- fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
- space, n32->base.chunks[i], n32->values[i]);
- }
- else
- {
- rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
-
- fprintf(stderr, "%schunk 0x%X ->",
- space, n32->base.chunks[i]);
-
- if (recurse)
- {
- rt_dump_node(n32->children[i], level + 1, recurse);
- }
- else
- fprintf(stderr, "\n");
- }
- }
- break;
- }
- case RT_NODE_KIND_125:
- {
- rt_node_base_125 *b125 = (rt_node_base_125 *) node;
-
- fprintf(stderr, "slot_idxs ");
- for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
- {
- if (!node_125_is_chunk_used(b125, i))
- continue;
-
- fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
- }
- if (NODE_IS_LEAF(node))
- {
- rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
-
- fprintf(stderr, ", isset-bitmap:");
- for (int i = 0; i < BM_IDX(128); i++)
- {
- fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
- }
- fprintf(stderr, "\n");
- }
-
- for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
- {
- if (!node_125_is_chunk_used(b125, i))
- continue;
-
- if (NODE_IS_LEAF(node))
- {
- rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) b125;
-
- fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
- space, i, node_leaf_125_get_value(n125, i));
- }
- else
- {
- rt_node_inner_125 *n125 = (rt_node_inner_125 *) b125;
-
- fprintf(stderr, "%schunk 0x%X ->",
- space, i);
-
- if (recurse)
- rt_dump_node(node_inner_125_get_child(n125, i),
- level + 1, recurse);
- else
- fprintf(stderr, "\n");
- }
- }
- break;
- }
- case RT_NODE_KIND_256:
- {
- for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
- {
- if (NODE_IS_LEAF(node))
- {
- rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
-
- if (!node_leaf_256_is_chunk_used(n256, i))
- continue;
-
- fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
- space, i, node_leaf_256_get_value(n256, i));
- }
- else
- {
- rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
-
- if (!node_inner_256_is_chunk_used(n256, i))
- continue;
-
- fprintf(stderr, "%schunk 0x%X ->",
- space, i);
-
- if (recurse)
- rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
- recurse);
- else
- fprintf(stderr, "\n");
- }
- }
- break;
- }
- }
-}
-
-void
-rt_dump_search(radix_tree *tree, uint64 key)
-{
- rt_node *node;
- int shift;
- int level = 0;
-
- elog(NOTICE, "-----------------------------------------------------------");
- elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
- tree->max_val, tree->max_val);
-
- if (!tree->root)
- {
- elog(NOTICE, "tree is empty");
- return;
- }
-
- if (key > tree->max_val)
- {
- elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
- key, key);
- return;
- }
-
- node = tree->root;
- shift = tree->root->shift;
- while (shift >= 0)
- {
- rt_node *child;
-
- rt_dump_node(node, level, false);
-
- if (NODE_IS_LEAF(node))
- {
- uint64 dummy;
-
- /* We reached at a leaf node, find the corresponding slot */
- rt_node_search_leaf(node, key, &dummy);
-
- break;
- }
-
- if (!rt_node_search_inner(node, key, &child))
- break;
-
- node = child;
- shift -= RT_NODE_SPAN;
- level++;
- }
-}
-
-void
-rt_dump(radix_tree *tree)
-{
-
- for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
- fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
- rt_size_class_info[i].name,
- rt_size_class_info[i].inner_size,
- rt_size_class_info[i].inner_blocksize,
- rt_size_class_info[i].leaf_size,
- rt_size_class_info[i].leaf_blocksize);
- fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
-
- if (!tree->root)
- {
- fprintf(stderr, "empty tree\n");
- return;
- }
-
- rt_dump_node(tree->root, 0, true);
-}
-#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d5d7668617..fe517793f4 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1,24 +1,412 @@
/*-------------------------------------------------------------------------
*
- * radixtree.h
- * Interface for radix tree.
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves". We
+ * choose it to avoid an additional pointer traversal. It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create - Create a new, empty radix tree
+ * rt_free - Free the radix tree
+ * rt_search - Search a key-value pair
+ * rt_set - Set a key-value pair
+ * rt_delete - Delete a key-value pair
+ * rt_begin_iterate - Begin iterating through all key-value pairs
+ * rt_iterate_next - Return next key-value pair, if any
+ * rt_end_iter - End iteration
+ * rt_memory_usage - Get the memory usage
+ * rt_num_entries - Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
*
* Copyright (c) 2022, PostgreSQL Global Development Group
*
* IDENTIFICATION
- * src/include/lib/radixtree.h
+ * src/backend/lib/radixtree.c
*
*-------------------------------------------------------------------------
*/
-#ifndef RADIXTREE_H
-#define RADIXTREE_H
#include "postgres.h"
-#define RT_DEBUG 1
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
+#define BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of rt_node. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+typedef enum rt_size_class
+{
+ RT_CLASS_4_FULL = 0,
+ RT_CLASS_32_PARTIAL,
+ RT_CLASS_32_FULL,
+ RT_CLASS_125_FULL,
+ RT_CLASS_256
+
+#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
+} rt_size_class;
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /* Max number of children. We can use uint8 because we never need to store 256 */
+ /* WIP: if we don't have a variable sized node4, this should instead be in the base
+ types as needed, since saving every byte is crucial for the smallest node kind */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+ uint8 chunk;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} rt_node;
+#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+ ((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+ ((node)->base.n.count < rt_size_class_info[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct rt_node_base_4
+{
+ rt_node n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+ rt_node n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base125
+{
+ rt_node n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[BM_IDX(128)];
+} rt_node_base_125;
+
+typedef struct rt_node_base256
+{
+ rt_node n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ * width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+ rt_node_base_4 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+ rt_node_base_4 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+ rt_node_base_32 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+ rt_node_base_32 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_125
+{
+ rt_node_base_125 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_125;
+
+typedef struct rt_node_leaf_125
+{
+ rt_node_base_125 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+ rt_node_base_256 base;
+
+ /* Slots for 256 children */
+ rt_node *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+ rt_node_base_256 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[BM_IDX(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ uint64 values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information for each size class */
+typedef struct rt_size_class_elem
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+
+ /* slab block size */
+ Size inner_blocksize;
+ Size leaf_blocksize;
+} rt_size_class_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
+ [RT_CLASS_4_FULL] = {
+ .name = "radix tree node 4",
+ .fanout = 4,
+ .inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_PARTIAL] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_FULL] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+ },
+ [RT_CLASS_125_FULL] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(rt_node_inner_256),
+ .leaf_size = sizeof(rt_node_leaf_256),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+ },
+};
+
+/* Map from the node kind to its minimum size class */
+static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
+ [RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+ [RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+ [RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+ [RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+/* A radix tree with nodes */
+typedef struct radix_tree
+{
+ MemoryContext context;
+
+ rt_node *root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} radix_tree;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+ rt_node *node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} rt_node_iter;
+
+typedef struct rt_iter
+{
+ radix_tree *tree;
-typedef struct radix_tree radix_tree;
-typedef struct rt_iter rt_iter;
+ /* Track the iteration on nodes of each level */
+ rt_node_iter stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+} rt_iter;
extern radix_tree *rt_create(MemoryContext ctx);
extern void rt_free(radix_tree *tree);
@@ -39,4 +427,1360 @@ extern void rt_dump_search(radix_tree *tree, uint64 key);
extern void rt_stats(radix_tree *tree);
#endif
-#endif /* RADIXTREE_H */
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+ bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+ uint8 *dst_chunks, rt_node **dst_children)
+{
+ const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(rt_node *) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+ uint8 *dst_chunks, uint64 *dst_values)
+{
+ const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(uint64) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+static inline rt_node *
+node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(node_inner_256_is_chunk_used(node, chunk));
+ return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(node_leaf_256_is_chunk_used(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[idx] |= ((bitmapword) 1 << bitnum);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+ int shift = key_get_shift(key);
+ bool inner = shift > 0;
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+ rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newnode->shift = shift;
+ tree->max_val = shift_get_max_val(shift);
+ tree->root = newnode;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
+{
+ rt_node *newnode;
+
+ if (inner)
+ newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+ rt_size_class_info[size_class].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ rt_size_class_info[size_class].leaf_size);
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[size_class]++;
+#endif
+
+ return newnode;
+}
+
+/* Initialize the node contents */
+static inline void
+rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+{
+ if (inner)
+ MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+ else
+ MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+
+ node->kind = kind;
+ node->fanout = rt_size_class_info[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+
+ memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+ }
+
+ /*
+ * Technically it's 256, but we cannot store that in a uint8,
+ * and this is the max size class to it will never grow.
+ */
+ if (kind == RT_NODE_KIND_256)
+ node->fanout = 0;
+}
+
+static inline void
+rt_copy_node(rt_node *newnode, rt_node *oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->chunk = oldnode->chunk;
+ newnode->count = oldnode->count;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node*
+rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+{
+ rt_node *newnode;
+ bool inner = !NODE_IS_LEAF(node);
+
+ newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
+ rt_init_node(newnode, new_kind, kind_min_size_class[new_kind], inner);
+ rt_copy_node(newnode, node);
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->root == node)
+ {
+ tree->root = NULL;
+ tree->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == rt_size_class_info[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->cnt[i]--;
+ Assert(tree->cnt[i] >= 0);
+ }
+#endif
+
+ pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+ rt_node *new_child, uint64 key)
+{
+ Assert(old_child->chunk == new_child->chunk);
+ Assert(old_child->shift == new_child->shift);
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new large node */
+ tree->root = new_child;
+ }
+ else
+ {
+ bool replaced PG_USED_FOR_ASSERTS_ONLY;
+
+ replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+ Assert(replaced);
+ }
+
+ rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+ int target_shift;
+ int shift = tree->root->shift + RT_NODE_SPAN;
+
+ target_shift = key_get_shift(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ rt_node_inner_4 *node;
+
+ node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+ rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+ node->base.n.shift = shift;
+ node->base.n.count = 1;
+ node->base.chunks[0] = 0;
+ node->children[0] = tree->root;
+
+ tree->root->chunk = 0;
+ tree->root = (rt_node *) node;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+ rt_node *node)
+{
+ int shift = node->shift;
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ rt_node *newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool inner = newshift > 0;
+
+ newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+ rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newchild->shift = newshift;
+ newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ rt_node_insert_inner(tree, parent, node, key, newchild);
+
+ parent = node;
+ node = newchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ rt_node_insert_leaf(tree, parent, node, key, value);
+ tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+rt_node_delete_inner(rt_node *node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+rt_node_delete_leaf(rt_node *node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+ rt_node *child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+ radix_tree *tree;
+ MemoryContext old_ctx;
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = palloc(sizeof(radix_tree));
+ tree->context = ctx;
+ tree->root = NULL;
+ tree->max_val = 0;
+ tree->num_keys = 0;
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].leaf_blocksize,
+ rt_size_class_info[i].leaf_size);
+#ifdef RT_DEBUG
+ tree->cnt[i] = 0;
+#endif
+ }
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+ int shift;
+ bool updated;
+ rt_node *node;
+ rt_node *parent;
+
+ /* Empty tree, create the root */
+ if (!tree->root)
+ rt_new_root(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->max_val)
+ rt_extend(tree, key);
+
+ Assert(tree->root);
+
+ shift = tree->root->shift;
+ node = parent = tree->root;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, &child))
+ {
+ rt_set_extend(tree, key, value, parent, node);
+ return false;
+ }
+
+ parent = node;
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->num_keys++;
+
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+ rt_node *node;
+ int shift;
+
+ Assert(value_p != NULL);
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ node = tree->root;
+ shift = tree->root->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ return rt_node_search_leaf(node, key, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ rt_node *stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ node = tree->root;
+ shift = tree->root->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ rt_node *child;
+
+ /* Push the current node to the stack */
+ stack[++level] = node;
+
+ if (!rt_node_search_inner(node, key, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ Assert(NODE_IS_LEAF(node));
+ deleted = rt_node_delete_leaf(node, key);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (!NODE_IS_EMPTY(node))
+ return true;
+
+ /* Free the empty leaf node */
+ rt_free_node(tree, node);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ node = stack[level--];
+
+ deleted = rt_node_delete_inner(node, key);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!NODE_IS_EMPTY(node))
+ break;
+
+ /* The node became empty */
+ rt_free_node(tree, node);
+ }
+
+ return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+ MemoryContext old_ctx;
+ rt_iter *iter;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (rt_iter *) palloc0(sizeof(rt_iter));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree->root)
+ return iter;
+
+ top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+ int level = from;
+ rt_node *node = from_node;
+
+ for (;;)
+ {
+ rt_node_iter *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = rt_node_inner_iterate_next(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->root)
+ return false;
+
+ for (;;)
+ {
+ rt_node *child = NULL;
+ uint64 value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ rt_update_iter_stack(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+ pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+ return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+ Size total = sizeof(radix_tree);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+ for (int i = 1; i < n4->n.count; i++)
+ Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ uint8 slot = n125->slot_idxs[i];
+ int idx = BM_IDX(slot);
+ int bitnum = BM_BIT(slot);
+
+ if (!node_125_is_chunk_used(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(slot < node->fanout);
+ Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
+ cnt += bmw_popcount(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+ ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+ tree->num_keys,
+ tree->root->shift / RT_NODE_SPAN,
+ tree->cnt[RT_CLASS_4_FULL],
+ tree->cnt[RT_CLASS_32_PARTIAL],
+ tree->cnt[RT_CLASS_32_FULL],
+ tree->cnt[RT_CLASS_125_FULL],
+ tree->cnt[RT_CLASS_256])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+ char space[125] = {0};
+
+ fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
+ NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift, node->chunk);
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n4->base.chunks[i], n4->values[i]);
+ }
+ else
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n4->base.chunks[i]);
+
+ if (recurse)
+ rt_dump_node(n4->children[i], level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n32->base.chunks[i], n32->values[i]);
+ }
+ else
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ rt_dump_node(n32->children[i], level + 1, recurse);
+ }
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+
+ fprintf(stderr, "slot_idxs ");
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used(b125, i))
+ continue;
+
+ fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+ }
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+
+ fprintf(stderr, ", isset-bitmap:");
+ for (int i = 0; i < BM_IDX(128); i++)
+ {
+ fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
+ }
+ fprintf(stderr, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used(b125, i))
+ continue;
+
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, node_leaf_125_get_value(n125, i));
+ }
+ else
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_125_get_child(n125, i),
+ level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, node_leaf_256_get_value(n256, i));
+ }
+ else
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+ recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+ tree->max_val, tree->max_val);
+
+ if (!tree->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->max_val)
+ {
+ elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+ key, key);
+ return;
+ }
+
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ rt_dump_node(node, level, false);
+
+ if (NODE_IS_LEAF(node))
+ {
+ uint64 dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ rt_node_search_leaf(node, key, &dummy);
+
+ break;
+ }
+
+ if (!rt_node_search_inner(node, key, &child))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_size,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].leaf_size,
+ rt_size_class_info[i].leaf_blocksize);
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+
+ if (!tree->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ rt_dump_node(tree->root, 0, true);
+}
+#endif
--
2.39.0
v17-0001-introduce-vector8_min-and-vector8_highbit_mask.patchtext/x-patch; charset=US-ASCII; name=v17-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload
From b5653e1d6ac004f5b5420d240f9c0ee142495874 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v17 1/9] introduce vector8_min and vector8_highbit_mask
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..84d41a340a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
static inline bool vector8_has_zero(const Vector8 v);
static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
#endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
#endif
}
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+ uint32 mask = 0;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+ return mask;
+#endif
+}
+
/*
* Exactly like vector8_is_highbit_set except for the input type, so it
* looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.39.0
v17-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v17-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload
From dd3ab2bf57b5fae0dd7c10a4a44d23db38d65140 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v17 2/9] Move some bitmap logic out of bitmapset.c
Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.
Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().
Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.
Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
src/backend/nodes/bitmapset.c | 36 ++------------------------------
src/include/nodes/bitmapset.h | 16 ++++++++++++--
src/include/port/pg_bitutils.h | 31 +++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 -
4 files changed, 47 insertions(+), 37 deletions(-)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x))
/*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
{
int result;
- w = RIGHTMOST_ONE(w);
+ w = bmw_rightmost_one(w);
a->words[wordnum] &= ~w;
result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 0dca6bc5fa..80e91fac0f 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
#else
#define BITS_PER_BITMAPWORD 32
typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
#endif
@@ -75,6 +73,20 @@ typedef enum
BMS_MULTIPLE /* >1 member */
} BMS_Membership;
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w) pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
+#define bmw_popcount(w) pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w) pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
+#define bmw_popcount(w) pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
/*
* function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+ int32 result = (int32) word & -((int32) word);
+ return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+ int64 result = (int64) word & -((int64) word);
+ return (uint64) result;
+}
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 50d86cb01b..e19fd2966d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3655,7 +3655,6 @@ shmem_request_hook_type
shmem_startup_hook_type
sig_atomic_t
sigjmp_buf
-signedbitmapword
sigset_t
size_t
slist_head
--
2.39.0
v17-0004-tool-for-measuring-radix-tree-performance.patchtext/x-patch; charset=US-ASCII; name=v17-0004-tool-for-measuring-radix-tree-performance.patchDownload
From f510d0d88460cbebebba9c089e38e02e054c71bb Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v17 4/9] tool for measuring radix tree performance
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 76 +++
contrib/bench_radix_tree/bench_radix_tree.c | 635 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
6 files changed, 767 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..2fd689aa91
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..a0693695e6
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,635 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 search_time_ms;
+ Datum values[2] = {0};
+ bool nulls[2] = {0};
+ /* from trial and error */
+ uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (!PG_ARGISNULL(1))
+ {
+ char *filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+ if (sscanf(filter_str, "0x%lX", &filter) == 0)
+ elog(ERROR, "invalid filter string %s", filter_str);
+ }
+ elog(NOTICE, "bench with filter 0x%lX", filter);
+
+ rt = rt_create(CurrentMemoryContext);
+
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+
+ rt_set(rt, key, key);
+ }
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_sparseload_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ uint64 r,
+ h;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ key_id = 0;
+
+ for (r = 1; r <= fanout; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (h);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+
+ rt_stats(rt);
+
+ /* measure sparse deletion and re-loading */
+ start_time = GetCurrentTimestamp();
+
+ for (int t = 0; t<10000; t++)
+ {
+ /* delete one key in each leaf */
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ rt_delete(rt, key);
+ }
+
+ /* add them all back */
+ key_id = 0;
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_sparseload_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
--
2.39.0
v17-0003-Add-radix-implementation.patchtext/x-patch; charset=US-ASCII; name=v17-0003-Add-radix-implementation.patchDownload
From dfe269fb71621a6ec580ef3d8ae601bdbc0c4b91 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v17 3/9] Add radix implementation.
---
src/backend/lib/Makefile | 1 +
src/backend/lib/meson.build | 1 +
src/backend/lib/radixtree.c | 2514 +++++++++++++++++
src/include/lib/radixtree.h | 42 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 36 +
src/test/modules/test_radixtree/meson.build | 34 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 581 ++++
.../test_radixtree/test_radixtree.control | 4 +
15 files changed, 3264 insertions(+)
create mode 100644 src/backend/lib/radixtree.c
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
integerset.o \
knapsack.o \
pairingheap.o \
+ radixtree.o \
rbtree.o \
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 974cab8776..5f8df32c5c 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -11,4 +11,5 @@ backend_sources += files(
'knapsack.c',
'pairingheap.c',
'rbtree.c',
+ 'radixtree.c',
)
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..5203127f76
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2514 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves". We
+ * choose it to avoid an additional pointer traversal. It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create - Create a new, empty radix tree
+ * rt_free - Free the radix tree
+ * rt_search - Search a key-value pair
+ * rt_set - Set a key-value pair
+ * rt_delete - Delete a key-value pair
+ * rt_begin_iterate - Begin iterating through all key-value pairs
+ * rt_iterate_next - Return next key-value pair, if any
+ * rt_end_iter - End iteration
+ * rt_memory_usage - Get the memory usage
+ * rt_num_entries - Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
+#define BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+ RT_ACTION_FIND = 0, /* find the key-value */
+ RT_ACTION_DELETE, /* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of rt_node. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+typedef enum rt_size_class
+{
+ RT_CLASS_4_FULL = 0,
+ RT_CLASS_32_PARTIAL,
+ RT_CLASS_32_FULL,
+ RT_CLASS_125_FULL,
+ RT_CLASS_256
+
+#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
+} rt_size_class;
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /* Max number of children. We can use uint8 because we never need to store 256 */
+ /* WIP: if we don't have a variable sized node4, this should instead be in the base
+ types as needed, since saving every byte is crucial for the smallest node kind */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+ uint8 chunk;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} rt_node;
+#define NODE_IS_LEAF(n) (((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n) (((rt_node *) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+ ((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+ ((node)->base.n.count < rt_size_class_info[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct rt_node_base_4
+{
+ rt_node n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+ rt_node n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base125
+{
+ rt_node n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[BM_IDX(128)];
+} rt_node_base_125;
+
+typedef struct rt_node_base256
+{
+ rt_node n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ * width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+ rt_node_base_4 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+ rt_node_base_4 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+ rt_node_base_32 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+ rt_node_base_32 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_125
+{
+ rt_node_base_125 base;
+
+ /* number of children depends on size class */
+ rt_node *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_125;
+
+typedef struct rt_node_leaf_125
+{
+ rt_node_base_125 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+ rt_node_base_256 base;
+
+ /* Slots for 256 children */
+ rt_node *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+ rt_node_base_256 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[BM_IDX(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ uint64 values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information for each size class */
+typedef struct rt_size_class_elem
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+
+ /* slab block size */
+ Size inner_blocksize;
+ Size leaf_blocksize;
+} rt_size_class_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
+ [RT_CLASS_4_FULL] = {
+ .name = "radix tree node 4",
+ .fanout = 4,
+ .inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_PARTIAL] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_FULL] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+ },
+ [RT_CLASS_125_FULL] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
+ .leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(rt_node_inner_256),
+ .leaf_size = sizeof(rt_node_leaf_256),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+ },
+};
+
+/* Map from the node kind to its minimum size class */
+static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
+ [RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+ [RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+ [RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+ [RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+ rt_node *node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+ radix_tree *tree;
+
+ /* Track the iteration on nodes of each level */
+ rt_node_iter stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+ MemoryContext context;
+
+ rt_node *root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+ bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+ rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+ uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+ uint8 *dst_chunks, rt_node **dst_children)
+{
+ const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(rt_node *) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+ uint8 *dst_chunks, uint64 *dst_values)
+{
+ const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(uint64) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+static inline rt_node *
+node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+static void
+node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
+{
+ int slotpos = node->base.slot_idxs[chunk];
+ int idx = BM_IDX(slotpos);
+ int bitnum = BM_BIT(slotpos);
+
+ Assert(!NODE_IS_LEAF(node));
+
+ node->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+ node->children[node->base.slot_idxs[chunk]] = NULL;
+ node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+static void
+node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
+{
+ int slotpos = node->base.slot_idxs[chunk];
+ int idx = BM_IDX(slotpos);
+ int bitnum = BM_BIT(slotpos);
+
+ Assert(NODE_IS_LEAF(node));
+ node->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+ node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+/* Return an unused slot in node-125 */
+static int
+node_125_find_unused_slot(bitmapword *isset)
+{
+ int slotpos;
+ int idx;
+ bitmapword inverse;
+
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < BM_IDX(128); idx++)
+ {
+ if (isset[idx] < ~((bitmapword) 0))
+ break;
+ }
+
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+
+ /* mark the slot used */
+ isset[idx] |= bmw_rightmost_one(inverse);
+
+ return slotpos;
+ }
+
+static inline void
+node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+ int slotpos;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ slotpos = node_125_find_unused_slot(node->base.isset);
+ Assert(slotpos < node->base.n.fanout);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+ int slotpos;
+
+ Assert(NODE_IS_LEAF(node));
+
+ slotpos = node_125_find_unused_slot(node->base.isset);
+ Assert(slotpos < node->base.n.fanout);
+
+ node->base.slot_idxs[chunk] = slotpos;
+ node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+static inline void
+node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+ Assert(NODE_IS_LEAF(node));
+ node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(node_inner_256_is_chunk_used(node, chunk));
+ return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(node_leaf_256_is_chunk_used(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[idx] |= ((bitmapword) 1 << bitnum);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+ int shift = key_get_shift(key);
+ bool inner = shift > 0;
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+ rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newnode->shift = shift;
+ tree->max_val = shift_get_max_val(shift);
+ tree->root = newnode;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
+{
+ rt_node *newnode;
+
+ if (inner)
+ newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+ rt_size_class_info[size_class].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ rt_size_class_info[size_class].leaf_size);
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[size_class]++;
+#endif
+
+ return newnode;
+}
+
+/* Initialize the node contents */
+static inline void
+rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+{
+ if (inner)
+ MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+ else
+ MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+
+ node->kind = kind;
+ node->fanout = rt_size_class_info[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+
+ memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+ }
+
+ /*
+ * Technically it's 256, but we cannot store that in a uint8,
+ * and this is the max size class to it will never grow.
+ */
+ if (kind == RT_NODE_KIND_256)
+ node->fanout = 0;
+}
+
+static inline void
+rt_copy_node(rt_node *newnode, rt_node *oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->chunk = oldnode->chunk;
+ newnode->count = oldnode->count;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node*
+rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+{
+ rt_node *newnode;
+ bool inner = !NODE_IS_LEAF(node);
+
+ newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
+ rt_init_node(newnode, new_kind, kind_min_size_class[new_kind], inner);
+ rt_copy_node(newnode, node);
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->root == node)
+ {
+ tree->root = NULL;
+ tree->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == rt_size_class_info[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->cnt[i]--;
+ Assert(tree->cnt[i] >= 0);
+ }
+#endif
+
+ pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+ rt_node *new_child, uint64 key)
+{
+ Assert(old_child->chunk == new_child->chunk);
+ Assert(old_child->shift == new_child->shift);
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new large node */
+ tree->root = new_child;
+ }
+ else
+ {
+ bool replaced PG_USED_FOR_ASSERTS_ONLY;
+
+ replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+ Assert(replaced);
+ }
+
+ rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+ int target_shift;
+ int shift = tree->root->shift + RT_NODE_SPAN;
+
+ target_shift = key_get_shift(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ rt_node_inner_4 *node;
+
+ node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+ rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+ node->base.n.shift = shift;
+ node->base.n.count = 1;
+ node->base.chunks[0] = 0;
+ node->children[0] = tree->root;
+
+ tree->root->chunk = 0;
+ tree->root = (rt_node *) node;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+ rt_node *node)
+{
+ int shift = node->shift;
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ rt_node *newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool inner = newshift > 0;
+
+ newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+ rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newchild->shift = newshift;
+ newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ rt_node_insert_inner(tree, parent, node, key, newchild);
+
+ parent = node;
+ node = newchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ rt_node_insert_leaf(tree, parent, node, key, value);
+ tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ rt_node *child = NULL;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = n4->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n4->base.chunks, n4->children,
+ n4->base.n.count, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = n32->children[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_children_array_delete(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ child = node_inner_125_get_child(n125, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_125_delete(n125, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ child = node_inner_256_get_child(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_inner_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ if (found && child_p)
+ *child_p = child;
+
+ return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool found = false;
+ uint64 value = 0;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = n4->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+ n4->base.n.count, idx);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+ if (idx < 0)
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = n32->values[idx];
+ else /* RT_ACTION_DELETE */
+ chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+ n32->base.n.count, idx);
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ break;
+
+ found = true;
+
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_125_get_value(n125, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_125_delete(n125, chunk);
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, chunk))
+ break;
+
+ found = true;
+ if (action == RT_ACTION_FIND)
+ value = node_leaf_256_get_value(n256, chunk);
+ else /* RT_ACTION_DELETE */
+ node_leaf_256_delete(n256, chunk);
+
+ break;
+ }
+ }
+
+ /* update statistics */
+ if (action == RT_ACTION_DELETE && found)
+ node->count--;
+
+ if (found && value_p)
+ *value_p = value;
+
+ return found;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+ rt_node *child)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+ rt_node *newnode = NULL;
+
+ Assert(!NODE_IS_LEAF(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+ int idx;
+
+ idx = node_4_search_eq(&n4->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->children[idx] = child;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ {
+ rt_node_inner_32 *new32;
+
+ /* grow node from 4 to 32 */
+ newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+ new32 = (rt_node_inner_32 *) newnode;
+ chunk_children_array_copy(n4->base.chunks, n4->children,
+ new32->base.chunks, new32->children);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, node, newnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos(&n4->base, chunk);
+ uint16 count = n4->base.n.count;
+
+ /* shift chunks and children */
+ if (count != 0 && insertpos < count)
+ chunk_children_array_shift(n4->base.chunks, n4->children,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->children[insertpos] = child;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ const rt_size_class_elem minclass = rt_size_class_info[RT_CLASS_32_PARTIAL];
+ const rt_size_class_elem maxclass = rt_size_class_info[RT_CLASS_32_FULL];
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+ int idx;
+
+ idx = node_32_search_eq(&n32->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->children[idx] = child;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+ n32->base.n.count == minclass.fanout)
+ {
+ /* grow to the next size class of this kind */
+ newnode = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+ memcpy(newnode, node, minclass.inner_size);
+ newnode->fanout = maxclass.fanout;
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, node, newnode, key);
+ node = newnode;
+
+ /* also update pointer for this kind */
+ n32 = (rt_node_inner_32 *) newnode;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ {
+ rt_node_inner_125 *new125;
+
+ /* grow node from 32 to 125 */
+ newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+ new125 = (rt_node_inner_125 *) newnode;
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
+
+ Assert(parent != NULL);
+ rt_replace_node(tree, parent, node, newnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = node_32_get_insertpos(&n32->base, chunk);
+ int16 count = n32->base.n.count;
+
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+ chunk_children_array_shift(n32->base.chunks, n32->children,
+ count, insertpos);
+ }
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->children[insertpos] = child;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+ int cnt = 0;
+
+ if (node_125_is_chunk_used(&n125->base, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ node_inner_125_update(n125, chunk, child);
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ {
+ rt_node_inner_256 *new256;
+ Assert(parent != NULL);
+
+ /* grow node from 125 to 256 */
+ newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+ new256 = (rt_node_inner_256 *) newnode;
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!node_125_is_chunk_used(&n125->base, i))
+ continue;
+
+ node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
+ cnt++;
+ }
+
+ rt_replace_node(tree, parent, node, newnode, key);
+ node = newnode;
+ }
+ else
+ {
+ node_inner_125_insert(n125, chunk, child);
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+ node_inner_256_set(n256, chunk, child);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+ uint64 key, uint64 value)
+{
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+ Assert(NODE_IS_LEAF(node));
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+ int idx;
+
+ idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n4->values[idx] = value;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ {
+ rt_node_leaf_32 *new32;
+ Assert(parent != NULL);
+
+ /* grow node from 4 to 32 */
+ new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+ RT_NODE_KIND_32);
+ chunk_values_array_copy(n4->base.chunks, n4->values,
+ new32->base.chunks, new32->values);
+ rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
+ node = (rt_node *) new32;
+ }
+ else
+ {
+ int insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+ int count = n4->base.n.count;
+
+ /* shift chunks and values */
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n4->base.chunks, n4->values,
+ count, insertpos);
+
+ n4->base.chunks[insertpos] = chunk;
+ n4->values[insertpos] = value;
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+ int idx;
+
+ idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->values[idx] = value;
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ {
+ Assert(parent != NULL);
+
+ if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+ {
+ /* use the same node kind, but expand to the next size class */
+ const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
+ const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+ rt_node_leaf_32 *new32;
+
+ new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+ memcpy(new32, n32, size);
+ new32->base.n.fanout = fanout;
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+ /* must update both pointers here */
+ node = (rt_node *) new32;
+ n32 = new32;
+
+ goto retry_insert_leaf_32;
+ }
+ else
+ {
+ rt_node_leaf_125 *new125;
+
+ /* grow node from 32 to 125 */
+ new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+ RT_NODE_KIND_125);
+ for (int i = 0; i < n32->base.n.count; i++)
+ node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
+
+ rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
+ key);
+ node = (rt_node *) new125;
+ }
+ }
+ else
+ {
+ retry_insert_leaf_32:
+ {
+ int insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+ int count = n32->base.n.count;
+
+ if (count != 0 && insertpos < count)
+ chunk_values_array_shift(n32->base.chunks, n32->values,
+ count, insertpos);
+
+ n32->base.chunks[insertpos] = chunk;
+ n32->values[insertpos] = value;
+ break;
+ }
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+ int cnt = 0;
+
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ node_leaf_125_update(n125, chunk, value);
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ {
+ rt_node_leaf_256 *new256;
+ Assert(parent != NULL);
+
+ /* grow node from 125 to 256 */
+ new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+ RT_NODE_KIND_256);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ continue;
+
+ node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
+ cnt++;
+ }
+
+ rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+ key);
+ node = (rt_node *) new256;
+ }
+ else
+ {
+ node_leaf_125_insert(n125, chunk, value);
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+ Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+ node_leaf_256_set(n256, chunk, value);
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ rt_verify_node(node);
+
+ return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+ radix_tree *tree;
+ MemoryContext old_ctx;
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = palloc(sizeof(radix_tree));
+ tree->context = ctx;
+ tree->root = NULL;
+ tree->max_val = 0;
+ tree->num_keys = 0;
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].leaf_blocksize,
+ rt_size_class_info[i].leaf_size);
+#ifdef RT_DEBUG
+ tree->cnt[i] = 0;
+#endif
+ }
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+ int shift;
+ bool updated;
+ rt_node *node;
+ rt_node *parent;
+
+ /* Empty tree, create the root */
+ if (!tree->root)
+ rt_new_root(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->max_val)
+ rt_extend(tree, key);
+
+ Assert(tree->root);
+
+ shift = tree->root->shift;
+ node = parent = tree->root;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ {
+ rt_set_extend(tree, key, value, parent, node);
+ return false;
+ }
+
+ parent = node;
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->num_keys++;
+
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+ rt_node *node;
+ int shift;
+
+ Assert(value_p != NULL);
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ node = tree->root;
+ shift = tree->root->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ rt_node *stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ node = tree->root;
+ shift = tree->root->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ rt_node *child;
+
+ /* Push the current node to the stack */
+ stack[++level] = node;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ Assert(NODE_IS_LEAF(node));
+ deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (!NODE_IS_EMPTY(node))
+ return true;
+
+ /* Free the empty leaf node */
+ rt_free_node(tree, node);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ node = stack[level--];
+
+ deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!NODE_IS_EMPTY(node))
+ break;
+
+ /* The node became empty */
+ rt_free_node(tree, node);
+ }
+
+ return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+ MemoryContext old_ctx;
+ rt_iter *iter;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (rt_iter *) palloc0(sizeof(rt_iter));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree->root)
+ return iter;
+
+ top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+ int level = from;
+ rt_node *node = from_node;
+
+ for (;;)
+ {
+ rt_node_iter *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = rt_node_inner_iterate_next(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->root)
+ return false;
+
+ for (;;)
+ {
+ rt_node *child = NULL;
+ uint64 value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ rt_update_iter_stack(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+ pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+ rt_node *child = NULL;
+ bool found = false;
+ uint8 key_chunk;
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+
+ child = n4->children[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+ child = n32->children[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_125_get_child(n125, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_inner_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ child = node_inner_256_get_child(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+ return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+ uint64 *value_p)
+{
+ rt_node *node = node_iter->node;
+ bool found = false;
+ uint64 value;
+ uint8 key_chunk;
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+
+ value = n4->values[node_iter->current_idx];
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+ value = n32->values[node_iter->current_idx];
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_125_get_value(n125, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (node_leaf_256_is_chunk_used(n256, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+ value = node_leaf_256_get_value(n256, i);
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+ *value_p = value;
+ }
+
+ return found;
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+ return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+ Size total = sizeof(radix_tree);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+ for (int i = 1; i < n4->n.count; i++)
+ Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ uint8 slot = n125->slot_idxs[i];
+ int bitnum = BM_BIT(slot);
+
+ if (!node_125_is_chunk_used(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(slot < node->fanout);
+ Assert((n125->isset[i] & ((bitmapword) 1 << bitnum)) != 0);
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
+ cnt += bmw_popcount(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+ ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+ tree->num_keys,
+ tree->root->shift / RT_NODE_SPAN,
+ tree->cnt[RT_CLASS_4_FULL],
+ tree->cnt[RT_CLASS_32_PARTIAL],
+ tree->cnt[RT_CLASS_32_FULL],
+ tree->cnt[RT_CLASS_125_FULL],
+ tree->cnt[RT_CLASS_256])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+ char space[125] = {0};
+
+ fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
+ NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift, node->chunk);
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n4->base.chunks[i], n4->values[i]);
+ }
+ else
+ {
+ rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n4->base.chunks[i]);
+
+ if (recurse)
+ rt_dump_node(n4->children[i], level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n32->base.chunks[i], n32->values[i]);
+ }
+ else
+ {
+ rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ rt_dump_node(n32->children[i], level + 1, recurse);
+ }
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+
+ fprintf(stderr, "slot_idxs ");
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used(b125, i))
+ continue;
+
+ fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+ }
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+
+ fprintf(stderr, ", isset-bitmap:");
+ for (int i = 0; i < BM_IDX(128); i++)
+ {
+ fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
+ }
+ fprintf(stderr, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!node_125_is_chunk_used(b125, i))
+ continue;
+
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, node_leaf_125_get_value(n125, i));
+ }
+ else
+ {
+ rt_node_inner_125 *n125 = (rt_node_inner_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_125_get_child(n125, i),
+ level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+ if (!node_leaf_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, node_leaf_256_get_value(n256, i));
+ }
+ else
+ {
+ rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+ if (!node_inner_256_is_chunk_used(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+ recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+ rt_node *node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+ tree->max_val, tree->max_val);
+
+ if (!tree->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->max_val)
+ {
+ elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+ key, key);
+ return;
+ }
+
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ rt_dump_node(node, level, false);
+
+ if (NODE_IS_LEAF(node))
+ {
+ uint64 dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+ break;
+ }
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+ rt_size_class_info[i].name,
+ rt_size_class_info[i].inner_size,
+ rt_size_class_info[i].inner_blocksize,
+ rt_size_class_info[i].leaf_size,
+ rt_size_class_info[i].leaf_blocksize);
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+
+ if (!tree->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif /* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
test_pg_db_role_setting \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
subdir('test_pg_db_role_setting')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..ea993e63df
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,581 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int rt_node_kind_fanouts[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 125, /* RT_NODE_KIND_125 */
+ 256 /* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ uint64 dummy;
+ uint64 key;
+ uint64 val;
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_set(radixtree, keys[i], keys[i] + 1))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ uint64 val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx - 1]
+ : rt_node_kind_fanouts[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx]
+ : rt_node_kind_fanouts[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ radix_tree *radixtree;
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+ radixtree = rt_create(radixtree_ctx);
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ test_basic(rt_node_kind_fanouts[i], false);
+ test_basic(rt_node_kind_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
--
2.39.0
On Mon, Jan 9, 2023 at 5:59 PM John Naylor <john.naylor@enterprisedb.com> wrote:
[working on templating]
In the end, I decided to base my effort on v8, and not v12 (based on one of my less-well-thought-out ideas). The latter was a good experiment, but it did not lead to an increase in readability as I had hoped. The attached v17 is still rough, but it's in good enough shape to evaluate a mostly-complete templating implementation.
I really appreciate your work!
v13-0007 had some changes to the regression tests, but I haven't included those. The tests from v13-0003 do pass, both locally and shared. I quickly hacked together changing shared/local tests by hand (need to recompile), but it would be good for maintainability if tests could run once each with local and shmem, but use the same "expected" test output.
Agreed.
Also, I didn't look to see if there were any changes in v14/15 that didn't have to do with precise memory accounting.
At this point, Masahiko, I'd appreciate your feedback on whether this is an improvement at all (or at least a good base for improvement), especially for integrating with the TID store. I think there are some advantages to the template approach. One possible disadvantage is needing separate functions for each local and shared memory.
If we go this route, I do think the TID store should invoke the template as static functions. I'm not quite comfortable with a global function that may not fit well with future use cases.
It looks no problem in terms of vacuum integration, although I've not
fully tested yet. TID store uses the radix tree as the main storage,
and with the template radix tree, the data types for shared and
non-shared will be different. TID store can have an union for the
radix tree and the structure would be like follows:
/* Per-backend state for a TidStore */
struct TidStore
{
/*
* Control object. This is allocated in DSA area 'area' in the shared
* case, otherwise in backend-local memory.
*/
TidStoreControl *control;
/* Storage for Tids */
union tree
{
local_radix_tree *local;
shared_radix_tree *shared;
};
/* DSA area for TidStore if used */
dsa_area *area;
};
In the functions of TID store, we need to call either local or shared
radix tree functions depending on whether TID store is shared or not.
We need if-branch for each key-value pair insertion, but I think it
would not be a big performance problem in TID store use cases, since
vacuum is an I/O intensive operation in many cases. Overall, I think
there is no problem and I'll investigate it in depth.
Apart from that, I've been considering the lock support for shared
radix tree. As we discussed before, the current usage (i.e, only
parallel index vacuum) doesn't require locking support at all, so it
would be enough to have a single lock for simplicity. If we want to
use the shared radix tree for other use cases such as the parallel
heap vacuum or the replacement of the hash table for shared buffers,
we would need better lock support. For example, if we want to support
Optimistic Lock Coupling[1]https://db.in.tum.de/~leis/papers/artsync.pdf, we would need to change not only the node
structure but also the logic. Which probably leads to widen the gap
between the code for non-shared and shared radix tree. In this case,
once we have a better radix tree optimized for shared case, perhaps we
can replace the templated shared radix tree with it. I'd like to hear
your opinion on this line.
One review point I'll mention: Somehow I didn't notice there is no use for the "chunk" field in the rt_node type -- it's only set to zero and copied when growing. What is the purpose? Removing it would allow the smallest node to take up only 32 bytes with a fanout of 3, by eliminating padding.
Oh, I didn't notice that. The chunk field was originally used when
redirecting the child pointer in the parent node from old to new
(grown) node. When redirecting the pointer, since the corresponding
chunk surely exists on the parent we can skip existence checks.
Currently we use RT_NODE_UPDATE_INNER() for that (see
RT_REPLACE_NODE()) but having a dedicated function to update the
existing chunk and child pointer might improve the performance. Or
reducing the node size by getting rid of the chunk field might be
better.
Also, v17-0005 has an optimization/simplification for growing into node125 (my version needs an assertion or fallback, but works well now), found by another reading of Andres' prototype There is a lot of good engineering there, we should try to preserve it.
Agreed.
Regards,
[1]: https://db.in.tum.de/~leis/papers/artsync.pdf
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
It looks no problem in terms of vacuum integration, although I've not
fully tested yet. TID store uses the radix tree as the main storage,
and with the template radix tree, the data types for shared and
non-shared will be different. TID store can have an union for the
radix tree and the structure would be like follows:
/* Storage for Tids */
union tree
{
local_radix_tree *local;
shared_radix_tree *shared;
};
We could possibly go back to using a common data type for this, but with
unused fields in each setting, as before. We would have to be more careful
of things like the 32-bit crash from a few weeks ago.
In the functions of TID store, we need to call either local or shared
radix tree functions depending on whether TID store is shared or not.
We need if-branch for each key-value pair insertion, but I think it
would not be a big performance problem in TID store use cases, since
vacuum is an I/O intensive operation in many cases.
Also, the branch will be easily predicted. That was still true in earlier
patches, but with many more branches and fatter code paths.
Overall, I think
there is no problem and I'll investigate it in depth.
Okay, great. If the separate-functions approach turns out to be ugly, we
can always go back to the branching approach for shared memory. I think
we'll want to keep this as a template overall, at least to allow different
value types and to ease adding variable-length keys if someone finds a need.
Apart from that, I've been considering the lock support for shared
radix tree. As we discussed before, the current usage (i.e, only
parallel index vacuum) doesn't require locking support at all, so it
would be enough to have a single lock for simplicity.
Right, that should be enough for PG16.
If we want to
use the shared radix tree for other use cases such as the parallel
heap vacuum or the replacement of the hash table for shared buffers,
we would need better lock support.
For future parallel pruning, I still think a global lock is "probably" fine
if the workers buffer in local arrays. Highly concurrent applications will
need additional work, of course.
For example, if we want to support
Optimistic Lock Coupling[1],
Interesting, from the same authors!
we would need to change not only the node
structure but also the logic. Which probably leads to widen the gap
between the code for non-shared and shared radix tree. In this case,
once we have a better radix tree optimized for shared case, perhaps we
can replace the templated shared radix tree with it. I'd like to hear
your opinion on this line.
I'm not in a position to speculate on how best to do scalable concurrency,
much less how it should coexist with the local implementation. It's
interesting that their "ROWEX" scheme gives up maintaining order in the
linear nodes.
One review point I'll mention: Somehow I didn't notice there is no use
for the "chunk" field in the rt_node type -- it's only set to zero and
copied when growing. What is the purpose? Removing it would allow the
smallest node to take up only 32 bytes with a fanout of 3, by eliminating
padding.
Oh, I didn't notice that. The chunk field was originally used when
redirecting the child pointer in the parent node from old to new
(grown) node. When redirecting the pointer, since the corresponding
chunk surely exists on the parent we can skip existence checks.
Currently we use RT_NODE_UPDATE_INNER() for that (see
RT_REPLACE_NODE()) but having a dedicated function to update the
existing chunk and child pointer might improve the performance. Or
reducing the node size by getting rid of the chunk field might be
better.
I see. IIUC from a brief re-reading of the code, saving that chunk would
only save us from re-loading "parent->shift" from L1 cache and shifting the
key. The cycles spent doing that seem small compared to the rest of the
work involved in growing a node. Expressions like "if (idx < 0) return
false;" return to an asserts-only variable, so in production builds, I
would hope that branch gets elided (I haven't checked).
I'm quite keen on making the smallest node padding-free, (since we don't
yet have path compression or lazy path expansion), and this seems the way
to get there.
--
John Naylor
EDB: http://www.enterprisedb.com
I wrote:
I see. IIUC from a brief re-reading of the code, saving that chunk would
only save us from re-loading "parent->shift" from L1 cache and shifting the
key. The cycles spent doing that seem small compared to the rest of the
work involved in growing a node. Expressions like "if (idx < 0) return
false;" return to an asserts-only variable, so in production builds, I
would hope that branch gets elided (I haven't checked).
On further reflection, this is completely false and I'm not sure what I was
thinking. However, for the update-inner case maybe we can assert that we
found a valid slot.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Jan 11, 2023 at 12:13 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
It looks no problem in terms of vacuum integration, although I've not
fully tested yet. TID store uses the radix tree as the main storage,
and with the template radix tree, the data types for shared and
non-shared will be different. TID store can have an union for the
radix tree and the structure would be like follows:/* Storage for Tids */
union tree
{
local_radix_tree *local;
shared_radix_tree *shared;
};We could possibly go back to using a common data type for this, but with unused fields in each setting, as before. We would have to be more careful of things like the 32-bit crash from a few weeks ago.
One idea to have a common data type without unused fields is to use
radix_tree a base class. We cast it to radix_tree_shared or
radix_tree_local depending on the flag is_shared in radix_tree. For
instance we have like (based on non-template version),
struct radix_tree
{
bool is_shared;
MemoryContext context;
};
typedef struct rt_shared
{
rt_handle handle;
uint32 magic;
/* Root node */
dsa_pointer root;
uint64 max_val;
uint64 num_keys;
/* need a lwlock */
/* statistics */
#ifdef RT_DEBUG
int32 cnt[RT_SIZE_CLASS_COUNT];
#endif
} rt_shared;
struct radix_tree_shared
{
radix_tree rt;
rt_shared *shared;
dsa_area *area;
} radix_tree_shared;
struct radix_tree_local
{
radix_tree rt;
uint64 max_val;
uint64 num_keys;
rt_node *root;
/* used only when the radix tree is private */
MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
/* statistics */
#ifdef RT_DEBUG
int32 cnt[RT_SIZE_CLASS_COUNT];
#endif
} radix_tree_local;
In the functions of TID store, we need to call either local or shared
radix tree functions depending on whether TID store is shared or not.
We need if-branch for each key-value pair insertion, but I think it
would not be a big performance problem in TID store use cases, since
vacuum is an I/O intensive operation in many cases.Also, the branch will be easily predicted. That was still true in earlier patches, but with many more branches and fatter code paths.
Overall, I think
there is no problem and I'll investigate it in depth.Okay, great. If the separate-functions approach turns out to be ugly, we can always go back to the branching approach for shared memory. I think we'll want to keep this as a template overall, at least to allow different value types and to ease adding variable-length keys if someone finds a need.
I agree to keep this as a template. From the vacuum integration
perspective, it would be better if we can use a common data type for
shared and local. It makes sense to have different data types if the
radix trees have different values types.
Apart from that, I've been considering the lock support for shared
radix tree. As we discussed before, the current usage (i.e, only
parallel index vacuum) doesn't require locking support at all, so it
would be enough to have a single lock for simplicity.Right, that should be enough for PG16.
If we want to
use the shared radix tree for other use cases such as the parallel
heap vacuum or the replacement of the hash table for shared buffers,
we would need better lock support.For future parallel pruning, I still think a global lock is "probably" fine if the workers buffer in local arrays. Highly concurrent applications will need additional work, of course.
For example, if we want to support
Optimistic Lock Coupling[1],Interesting, from the same authors!
+1
we would need to change not only the node
structure but also the logic. Which probably leads to widen the gap
between the code for non-shared and shared radix tree. In this case,
once we have a better radix tree optimized for shared case, perhaps we
can replace the templated shared radix tree with it. I'd like to hear
your opinion on this line.I'm not in a position to speculate on how best to do scalable concurrency, much less how it should coexist with the local implementation. It's interesting that their "ROWEX" scheme gives up maintaining order in the linear nodes.
One review point I'll mention: Somehow I didn't notice there is no use for the "chunk" field in the rt_node type -- it's only set to zero and copied when growing. What is the purpose? Removing it would allow the smallest node to take up only 32 bytes with a fanout of 3, by eliminating padding.
Oh, I didn't notice that. The chunk field was originally used when
redirecting the child pointer in the parent node from old to new
(grown) node. When redirecting the pointer, since the corresponding
chunk surely exists on the parent we can skip existence checks.
Currently we use RT_NODE_UPDATE_INNER() for that (see
RT_REPLACE_NODE()) but having a dedicated function to update the
existing chunk and child pointer might improve the performance. Or
reducing the node size by getting rid of the chunk field might be
better.I see. IIUC from a brief re-reading of the code, saving that chunk would only save us from re-loading "parent->shift" from L1 cache and shifting the key. The cycles spent doing that seem small compared to the rest of the work involved in growing a node. Expressions like "if (idx < 0) return false;" return to an asserts-only variable, so in production builds, I would hope that branch gets elided (I haven't checked).
I'm quite keen on making the smallest node padding-free, (since we don't yet have path compression or lazy path expansion), and this seems the way to get there.
Okay, let's get rid of that in the v18.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Thu, Jan 12, 2023 at 12:44 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Wed, Jan 11, 2023 at 12:13 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
I agree to keep this as a template.
Okay, I'll squash the previous patch and work on cleaning up the internals.
I'll keep the external APIs the same so that your work on vacuum
integration can be easily rebased on top of that, and we can work
independently.
From the vacuum integration
perspective, it would be better if we can use a common data type for
shared and local. It makes sense to have different data types if the
radix trees have different values types.
I agree it would be better, all else being equal. I have some further
thoughts below.
It looks no problem in terms of vacuum integration, although I've not
fully tested yet. TID store uses the radix tree as the main storage,
and with the template radix tree, the data types for shared and
non-shared will be different. TID store can have an union for the
radix tree and the structure would be like follows:/* Storage for Tids */
union tree
{
local_radix_tree *local;
shared_radix_tree *shared;
};We could possibly go back to using a common data type for this, but
with unused fields in each setting, as before. We would have to be more
careful of things like the 32-bit crash from a few weeks ago.
One idea to have a common data type without unused fields is to use
radix_tree a base class. We cast it to radix_tree_shared or
radix_tree_local depending on the flag is_shared in radix_tree. For
instance we have like (based on non-template version),
struct radix_tree
{
bool is_shared;
MemoryContext context;
};
That could work in principle. My first impression is, just a memory context
is not much of a base class. Also, casts can creep into a large number of
places.
Another thought came to mind: I'm guessing the TID store is unusual --
meaning most uses of radix tree will only need one kind of memory
(local/shared). I could be wrong about that, and it _is_ a guess about the
future. If true, then it makes more sense that only code that needs both
memory kinds should be responsible for keeping them separate.
The template might be easier for future use cases if shared memory were
all-or-nothing, meaning either
- completely different functions and types depending on RT_SHMEM
- use branches (like v8)
The union sounds like a good thing to try, but do whatever seems right.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Jan 12, 2023 at 5:21 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Thu, Jan 12, 2023 at 12:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Jan 11, 2023 at 12:13 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I agree to keep this as a template.
Okay, I'll squash the previous patch and work on cleaning up the internals. I'll keep the external APIs the same so that your work on vacuum integration can be easily rebased on top of that, and we can work independently.
Thanks!
From the vacuum integration
perspective, it would be better if we can use a common data type for
shared and local. It makes sense to have different data types if the
radix trees have different values types.I agree it would be better, all else being equal. I have some further thoughts below.
It looks no problem in terms of vacuum integration, although I've not
fully tested yet. TID store uses the radix tree as the main storage,
and with the template radix tree, the data types for shared and
non-shared will be different. TID store can have an union for the
radix tree and the structure would be like follows:/* Storage for Tids */
union tree
{
local_radix_tree *local;
shared_radix_tree *shared;
};We could possibly go back to using a common data type for this, but with unused fields in each setting, as before. We would have to be more careful of things like the 32-bit crash from a few weeks ago.
One idea to have a common data type without unused fields is to use
radix_tree a base class. We cast it to radix_tree_shared or
radix_tree_local depending on the flag is_shared in radix_tree. For
instance we have like (based on non-template version),struct radix_tree
{
bool is_shared;
MemoryContext context;
};That could work in principle. My first impression is, just a memory context is not much of a base class. Also, casts can creep into a large number of places.
Another thought came to mind: I'm guessing the TID store is unusual -- meaning most uses of radix tree will only need one kind of memory (local/shared). I could be wrong about that, and it _is_ a guess about the future. If true, then it makes more sense that only code that needs both memory kinds should be responsible for keeping them separate.
True.
The template might be easier for future use cases if shared memory were all-or-nothing, meaning either
- completely different functions and types depending on RT_SHMEM
- use branches (like v8)The union sounds like a good thing to try, but do whatever seems right.
I've implemented the idea of using union. Let me share WIP code for
discussion, I've attached three patches that can be applied on top of
v17-0009 patch. v17-0010 implements missing shared memory support
functions such as RT_DETACH and RT_GET_HANDLE, and some fixes.
v17-0011 patch adds TidStore, and v17-0012 patch is the vacuum
integration.
Overall, TidStore implementation with the union idea doesn't look so
ugly to me. But I got many compiler warning about unused radix tree
functions like:
tidstore.c:99:19: warning: 'shared_rt_delete' defined but not used
[-Wunused-function]
I'm not sure there is a convenient way to suppress this warning but
one idea is to have some macros to specify what operations are
enabled/declared.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v17-0010-fix-shmem-support.patchapplication/octet-stream; name=v17-0010-fix-shmem-support.patchDownload
From 56a45a0731abc33b3894d0aa0de06869d894637b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 12 Jan 2023 23:22:22 +0900
Subject: [PATCH v17 10/12] fix shmem support
---
src/include/lib/radixtree.h | 87 ++++++++++++++++++++++++---
src/include/lib/radixtree_iter_impl.h | 4 ++
2 files changed, 82 insertions(+), 9 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 2b58a0cdf5..a2e2e7a190 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -100,6 +100,8 @@
#define RT_SEARCH RT_MAKE_NAME(search)
#ifdef RT_SHMEM
#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
#endif
#define RT_SET RT_MAKE_NAME(set)
#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
@@ -164,6 +166,9 @@
#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
#define RT_NODE RT_MAKE_NAME(node)
#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
#define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
@@ -194,9 +199,15 @@
typedef struct RT_RADIX_TREE RT_RADIX_TREE;
typedef struct RT_ITER RT_ITER;
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
#ifdef RT_SHMEM
RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
#else
RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
#endif
@@ -542,9 +553,19 @@ static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
[RT_NODE_KIND_256] = RT_CLASS_256,
};
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
/* A radix tree with nodes */
typedef struct RT_RADIX_TREE_CONTROL
{
+#ifdef RT_SHMEM
+ RT_HANDLE handle;
+ uint32 magic;
+#endif
+
RT_PTR_ALLOC root;
uint64 max_val;
uint64 num_keys;
@@ -565,7 +586,6 @@ typedef struct RT_RADIX_TREE
#ifdef RT_SHMEM
dsa_area *dsa;
- dsa_pointer ctl_dp;
#else
MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
@@ -1311,6 +1331,9 @@ RT_CREATE(MemoryContext ctx)
{
RT_RADIX_TREE *tree;
MemoryContext old_ctx;
+#ifdef RT_SHMEM
+ dsa_pointer dp;
+#endif
old_ctx = MemoryContextSwitchTo(ctx);
@@ -1319,8 +1342,10 @@ RT_CREATE(MemoryContext ctx)
#ifdef RT_SHMEM
tree->dsa = dsa;
- tree->ctl_dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
- tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, tree->ctl_dp);
+ dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+ tree->ctl->handle = dp;
+ tree->ctl->magic = RT_RADIX_TREE_MAGIC;
#else
tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
@@ -1346,21 +1371,40 @@ RT_CREATE(MemoryContext ctx)
}
#ifdef RT_SHMEM
-RT_RADIX_TREE *
-RT_ATTACH(dsa_area *dsa, dsa_pointer dp)
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
{
RT_RADIX_TREE *tree;
+ dsa_pointer control;
/* XXX: memory context support */
tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
- tree->ctl_dp = dp;
- tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+ /* Find the control object in shard memory */
+ control = handle;
+
+ tree->dsa = dsa;
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
/* XXX: do we need to set a callback on exit to detach dsa? */
return tree;
}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ return tree->ctl->handle;
+}
#endif
/*
@@ -1370,8 +1414,15 @@ RT_SCOPE void
RT_FREE(RT_RADIX_TREE *tree)
{
#ifdef RT_SHMEM
- dsa_free(tree->dsa, tree->ctl_dp); // XXX
- dsa_detach(tree->dsa);
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->dsa, tree->ctl->handle); // XXX
+ //dsa_detach(tree->dsa);
#else
pfree(tree->ctl);
@@ -1398,6 +1449,10 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
RT_PTR_ALLOC nodep;
RT_PTR_LOCAL node;
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
/* Empty tree, create the root */
if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
RT_NEW_ROOT(tree, key);
@@ -1453,6 +1508,9 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
RT_PTR_LOCAL node;
int shift;
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
Assert(value_p != NULL);
if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
@@ -1493,6 +1551,10 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
int level;
bool deleted;
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
return false;
@@ -1736,6 +1798,7 @@ RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
Size total = sizeof(RT_RADIX_TREE);
#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
total = dsa_get_total_size(tree->dsa);
#else
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
@@ -2085,10 +2148,14 @@ rt_dump(RT_RADIX_TREE *tree)
#undef VAR_NODE_HAS_FREE_SLOT
#undef FIXED_NODE_HAS_FREE_SLOT
#undef RT_SIZE_CLASS_COUNT
+#undef RT_RADIX_TREE_MAGIC
/* type declarations */
#undef RT_RADIX_TREE
#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
#undef RT_ITER
#undef RT_NODE
#undef RT_NODE_ITER
@@ -2118,6 +2185,8 @@ rt_dump(RT_RADIX_TREE *tree)
#undef RT_CREATE
#undef RT_FREE
#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
#undef RT_SET
#undef RT_BEGIN_ITERATE
#undef RT_ITERATE_NEXT
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index 09d2018dc0..fd00851732 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -12,6 +12,10 @@
#error node level must be either inner or leaf
#endif
+#ifdef RT_SHMEM
+ Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
bool found = false;
uint8 key_chunk;
--
2.31.1
v17-0011-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchapplication/octet-stream; name=v17-0011-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload
From c9e8bb135bdfc555153f1e6b324968701f6a26a0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v17 11/12] Add TIDStore, to store sets of TIDs
(ItemPointerData) efficiently.
The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.
The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.
This includes a unit test module, in src/test/modules/test_tidstore.
---
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 587 ++++++++++++++++++
src/include/access/tidstore.h | 49 ++
src/test/modules/test_tidstore/Makefile | 23 +
.../test_tidstore/expected/test_tidstore.out | 13 +
src/test/modules/test_tidstore/meson.build | 34 +
.../test_tidstore/sql/test_tidstore.sql | 7 +
.../test_tidstore/test_tidstore--1.0.sql | 8 +
.../test_tidstore/test_tidstore.control | 4 +
10 files changed, 727 insertions(+)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
create mode 100644 src/test/modules/test_tidstore/Makefile
create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
create mode 100644 src/test/modules/test_tidstore/meson.build
create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore.control
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..4170d13b3c
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,587 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, Tid are encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach(). It can support concurrent updates but only one process
+ * is allowed to iterate over the TidStore at a time.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, item pointers are represented as a pair of 64-bit
+ * key and 64-bit value. First, we construct 64-bit unsigned integer key that
+ * combines the block number and the offset number. The lowest 11 bits represent
+ * the offset number, and the next 32 bits are block number. That is, only 43
+ * bits are used:
+ *
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ *
+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TidSTORE_OFFSET_NBITS). We are frugal with
+ * the bits, because smaller keys could help keeping the radix tree shallow.
+ *
+ * XXX: If we want to support other table AMs that want to use the full range
+ * of possible offset numbers, we'll need to change this.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits, and
+ * the rest 37 bits are used as the key:
+ *
+ * value = bitmap representation of XXXXXX
+ * key = XXXXXYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYuu
+ *
+ * The maximum height of the radix tree is 5.
+ *
+ * XXX: if we want to support non-heap table AM, we need to reconsider
+ * TIDSTORE_OFFSET_NBITS value.
+ */
+#define TIDSTORE_OFFSET_NBITS 11
+#define TIDSTORE_VALUE_NBITS 6
+
+/*
+ * Memory consumption depends on the number of Tids stored, but also on the
+ * distribution of them and how the radix tree stores them. The maximum bytes
+ * that a TidStore can use is specified by the max_bytes in tidstore_create().
+ *
+ * In non-shared cases, the radix tree uses a slab allocator for each kind of
+ * node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate the
+ * largest radix tree node in a new slab block, which is approximately 70kB.
+ * Therefore, we deduct 70kB from the maximum bytes.
+ *
+ * In shared cases, DSA allocates the memory segments to bit enough to follow
+ * a geometric series that approximately doubles the total DSA size. So we
+ * limit the maximum bytes for a TidStore to 75%. The 75% threshold perfectly
+ * works in case where the maximum bytes is power-of-2. In other cases, we
+ * use 60& threshold.
+ */
+#define TIDSTORE_MEMORY_DEDUCT_BYTES (1024L * 70) /* 70kB */
+
+/* Get block number from the key */
+#define KEY_GET_BLKNO(key) \
+ ((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+/* The header object for a TidStore */
+typedef struct TidStoreControl
+{
+ /*
+ * 'num_tids' is the number of Tids stored so far. 'max_byte' is the maximum
+ * bytes a TidStore can use. These two fields are commonly used in both
+ * non-shared case and shared case.
+ */
+ uint32 num_tids;
+ uint64 max_bytes;
+
+ /* The below fields are used only in shared case */
+
+ uint32 magic;
+
+ /* handles for TidStore and radix tree */
+ tidstore_handle handle;
+ shared_rt_handle tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+ /*
+ * Control object. This is allocated in DSA area 'area' in the shared
+ * case, otherwise in backend-local memory.
+ */
+ TidStoreControl *control;
+
+ /* Storage for Tids */
+ union
+ {
+ local_rt_radix_tree *local;
+ shared_rt_radix_tree *shared;
+ } tree;
+
+ /* DSA area for TidStore if used */
+ dsa_area *area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+ TidStore *ts;
+
+ /* iterator of radix tree */
+ union
+ {
+ shared_rt_iter *shared;
+ local_rt_iter *local;
+ } tree_iter;
+
+ /* we returned all tids? */
+ bool finished;
+
+ /* save for the next iteration */
+ uint64 next_key;
+ uint64 next_val;
+
+ /* output for the caller */
+ TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(uint64 max_bytes, dsa_area *area)
+{
+ TidStore *ts;
+
+ ts = palloc0(sizeof(TidStore));
+
+ /*
+ * Create the radix tree for the main storage.
+ *
+ * We calculate the maximum bytes for the TidStore in different ways
+ * for non-shared case and shared case. Please refer to the comment
+ * TIDSTORE_MEMORY_DEDUCT for details.
+ */
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+ float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, area);
+
+ dp = dsa_allocate0(area, sizeof(TidStoreControl));
+ ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+ ts->control->max_bytes =(uint64) (max_bytes * ratio);
+ ts->area = area;
+
+ ts->control->magic = TIDSTORE_MAGIC;
+ ts->control->handle = dp;
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+ }
+ else
+ {
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+ ts->control->max_bytes = max_bytes - TIDSTORE_MEMORY_DEDUCT_BYTES;
+ }
+
+ return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ /* create per-backend state */
+ ts = palloc0(sizeof(TidStore));
+
+ /* Find the control object in shared memory */
+ control = handle;
+
+ /* Set up the TidStore */
+ ts->control = (TidStoreControl *) dsa_get_address(area, control);
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+ ts->area = area;
+
+ return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ shared_rt_detach(ts->tree.shared);
+ pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory. The caller must be certain that
+ * no other backend will attempt to access the TidStore before calling this
+ * function. Other backend must explicitly call tidstore_detach to free up
+ * backend-local memory associated with the TidStore. The backend that calls
+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ ts->control->magic = 0;
+ dsa_free(ts->area, ts->control->handle);
+ shared_rt_free(ts->tree.shared);
+ }
+ else
+ {
+ pfree(ts->control);
+ local_rt_free(ts->tree.local);
+ }
+
+ pfree(ts);
+}
+
+/* Forget all collected Tids */
+void
+tidstore_reset(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+
+ /*
+ * Free the current radix tree, and Return allocated DSM segments
+ * to the operating system, if necessary. */
+ if (TidStoreIsShared(ts))
+ {
+ shared_rt_free(ts->tree.shared);
+ dsa_trim(ts->area);
+
+ /* Recreate the radix tree */
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area);
+
+ /* update the radix tree handle as we recreated it */
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+ }
+ else
+ {
+ local_rt_free(ts->tree.local);
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+ }
+}
+
+static inline void
+tidstore_insert_kv(TidStore *ts, uint64 key, uint64 val)
+{
+ if (TidStoreIsShared(ts))
+ shared_rt_set(ts->tree.shared, key, val);
+ else
+ local_rt_set(ts->tree.local, key, val);
+}
+
+/*
+ * Add Tids on a block to TidStore. The caller must ensure the offset numbers
+ * in 'offsets' are ordered in ascending order.
+ */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 key;
+ uint64 val = 0;
+ ItemPointerData tid;
+
+ ItemPointerSetBlockNumber(&tid, blkno);
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint32 off;
+
+ ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+ key = tid_to_key_off(&tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ /* insert the key-value */
+ tidstore_insert_kv(ts, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= UINT64CONST(1) << off;
+ }
+
+ if (last_key != PG_UINT64_MAX)
+ {
+ /* insert the key-value */
+ tidstore_insert_kv(ts, last_key, val);
+ }
+
+ /* update statistics */
+ ts->control->num_tids += num_offsets;
+}
+
+/* Return true if the given Tid is present in TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val;
+ uint32 off;
+ bool found;
+
+ key = tid_to_key_off(tid, &off);
+
+ found = TidStoreIsShared(ts) ?
+ shared_rt_search(ts->tree.shared, key, &val) :
+ local_rt_search(ts->tree.local, key, &val);
+
+ if (!found)
+ return false;
+
+ return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. The caller must be certain that
+ * no other backend will attempt to update the TidStore during the iteration.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+ TidStoreIter *iter;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ iter = palloc0(sizeof(TidStoreIter));
+ iter->ts = ts;
+ iter->result.blkno = InvalidBlockNumber;
+
+ if (TidStoreIsShared(ts))
+ iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+ else
+ iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+ /* If the TidStore is empty, there is no business */
+ if (ts->control->num_tids == 0)
+ iter->finished = true;
+
+ return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+ else
+ return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a TidStoreIterResult representing Tids
+ * in one page. Offset numbers in the result is sorted.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+ TidStoreIterResult *result = &(iter->result);
+
+ if (iter->finished)
+ return NULL;
+
+ if (BlockNumberIsValid(result->blkno))
+ {
+ result->num_offsets = 0;
+ tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (tidstore_iter_kv(iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = KEY_GET_BLKNO(key);
+
+ if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+ {
+ /*
+ * Remember the key-value pair for the next block for the
+ * next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+ return result;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_extract_tids(iter, key, val);
+ }
+
+ iter->finished = true;
+ return result;
+}
+
+/* Finish an iteration over TidStore */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+ if (TidStoreIsShared(iter->ts))
+ shared_rt_end_iterate(iter->tree_iter.shared);
+ else
+ local_rt_end_iterate(iter->tree_iter.local);
+
+ pfree(iter);
+}
+
+/* Return the number of Tids we collected so far */
+uint64
+tidstore_num_tids(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+uint64
+tidstore_max_memory(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+uint64
+tidstore_memory_usage(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return (uint64) sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+ else
+ return (uint64) sizeof(TidStore) + sizeof(TidStore) +
+ local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->handle;
+}
+
+/* Extract Tids from key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+ TidStoreIterResult *result = (&iter->result);
+
+ for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ if ((val & (UINT64CONST(1) << i)) == 0)
+ continue;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= i;
+
+ off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+ result->offsets[result->num_offsets++] = off;
+ }
+
+ result->blkno = KEY_GET_BLKNO(key);
+}
+
+/*
+ * Encode a Tid to key and val.
+ */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint64 tid_i;
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+ *off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+ upper = tid_i >> TIDSTORE_VALUE_NBITS;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ return upper;
+}
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..4bffdf0920
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "lib/radixtree.h"
+#include "storage/itemptr.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+ BlockNumber blkno;
+ OffsetNumber offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+ int num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(uint64 max_bytes, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern uint64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern uint64 tidstore_max_memory(TidStore *ts);
+extern uint64 tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif /* TIDSTORE_H */
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..1973963440
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE: testing empty tidstore
+NOTICE: testing basic operations
+ test_tidstore
+---------------
+
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..3365b073a4
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+ 'test_tidstore.c',
+)
+
+if host_system == 'windows'
+ test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_tidstore',
+ '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+ test_tidstore_sources,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+ 'test_tidstore.control',
+ 'test_tidstore--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_tidstore',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_tidstore',
+ ],
+ },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
--
2.31.1
v17-0012-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchapplication/octet-stream; name=v17-0012-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload
From 6f5a52a3bd7c018b42cbd7db1f9cad47d378c816 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 12 Jan 2023 22:04:20 +0900
Subject: [PATCH v17 12/12] Use TIDStore for storing dead tuple TID during lazy
vacuum.
Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which is not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.
This changes to use TIDStore for this purpose. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.
Also, since we are no longer able to exactly estimate the maximum
number of TIDs can be stored based on the amount of memory. It also
changes to the column names max_dead_tuples and num_dead_tuples and to
show the progress information in bytes.
Furthermore, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, which is the
inital DSA segment size. Due to that, this change increase the minimum
maintenance_work_mem from 1MB to 2MB.
---
doc/src/sgml/monitoring.sgml | 8 +-
src/backend/access/heap/vacuumlazy.c | 169 +++++++--------------
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 76 +--------
src/backend/commands/vacuumparallel.c | 64 +++++---
src/backend/storage/lmgr/lwlock.c | 2 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/commands/progress.h | 4 +-
src/include/commands/vacuum.h | 25 +--
src/include/storage/lwlock.h | 1 +
src/test/regress/expected/cluster.out | 2 +-
src/test/regress/expected/create_index.out | 2 +-
src/test/regress/expected/rules.out | 4 +-
src/test/regress/sql/cluster.sql | 2 +-
src/test/regress/sql/create_index.sql | 2 +-
15 files changed, 122 insertions(+), 243 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 358d2ff90f..6ce7ea9e35 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6840,10 +6840,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>max_dead_tuples</structfield> <type>bigint</type>
+ <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples that we can store before needing to perform
+ Amount of dead tuple data that we can store before needing to perform
an index vacuum cycle, based on
<xref linkend="guc-maintenance-work-mem"/>.
</para></entry>
@@ -6851,10 +6851,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>num_dead_tuples</structfield> <type>bigint</type>
+ <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples collected since the last index vacuum cycle.
+ Amount of dead tuple data collected since the last index vacuum cycle.
</para></entry>
</row>
</tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a42e881da3..1041e6640f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TidStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -259,8 +260,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer *vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer *vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -825,21 +827,21 @@ lazy_scan_heap(LVRelState *vacrel)
blkno,
next_unskippable_block,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TidStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
const int initprog_index[] = {
PROGRESS_VACUUM_PHASE,
PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
- PROGRESS_VACUUM_MAX_DEAD_TUPLES
+ PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
};
int64 initprog_val[3];
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +908,7 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ if (tidstore_is_full(vacrel->dead_items))
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -1039,11 +1040,18 @@ lazy_scan_heap(LVRelState *vacrel)
if (prunestate.has_lpdead_items)
{
Size freespace;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ result = tidstore_iterate_next(iter);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+ buf, &vmbuffer);
+ Assert(!tidstore_iterate_next(iter));
+ tidstore_end_iterate(iter);
/* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ tidstore_reset(dead_items);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1080,7 +1088,7 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(tidstore_num_tids(dead_items) == 0);
}
/*
@@ -1233,7 +1241,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (tidstore_num_tids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1871,23 +1879,15 @@ retry:
*/
if (lpdead_items > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TidStore *dead_items = vacrel->dead_items;
vacrel->lpdead_item_pages++;
prunestate->has_lpdead_items = true;
- ItemPointerSetBlockNumber(&tmp, blkno);
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
/*
* It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -2107,8 +2107,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TidStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2117,17 +2116,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2176,7 +2168,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
return;
}
@@ -2205,7 +2197,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2232,8 +2224,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2278,7 +2270,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
}
/*
@@ -2351,7 +2343,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2388,10 +2380,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index;
BlockNumber vacuumed_pages;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2408,8 +2401,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuumed_pages = 0;
- index = 0;
- while (index < vacrel->dead_items->num_items)
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ while ((result = tidstore_iterate_next(iter)) != NULL)
{
BlockNumber tblk;
Buffer buf;
@@ -2418,12 +2411,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- tblk = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ tblk = result->blkno;
vacrel->blkno = tblk;
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, tblk, buf, index, &vmbuffer);
+ lazy_vacuum_heap_page(vacrel, tblk, result->offsets, result->num_offsets,
+ buf, &vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2433,6 +2427,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
RecordPageWithFreeSpace(vacrel->rel, tblk, freespace);
vacuumed_pages++;
}
+ tidstore_end_iterate(iter);
/* Clear the block number information */
vacrel->blkno = InvalidBlockNumber;
@@ -2447,14 +2442,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2471,11 +2465,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* LP_DEAD item on the page. The return value is the first index immediately
* after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer *vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets, Buffer buffer, Buffer *vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int uncnt = 0;
@@ -2494,16 +2487,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = offsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2583,7 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -3079,46 +3066,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
return vacrel->nonempty_pages;
}
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
/*
* Allocate dead_items (either using palloc, or in dynamic shared memory).
* Sets dead_items in vacrel for caller.
@@ -3129,11 +3076,9 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
+ int vac_work_mem = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
@@ -3160,7 +3105,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
+ vac_work_mem,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3173,11 +3118,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = tidstore_create(vac_work_mem, NULL);
}
/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 447c9b970f..133e03d728 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1165,7 +1165,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
END AS phase,
S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
- S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+ S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index c4ed7efce3..7de4350cde 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -95,7 +95,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int vac_cmp_itemptr(const void *left, const void *right);
/*
* Primary entry point for manual VACUUM and ANALYZE commands
@@ -2298,16 +2297,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TidStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ tidstore_num_tids(dead_items))));
return istat;
}
@@ -2338,18 +2337,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
@@ -2360,60 +2347,7 @@ vac_max_items_to_alloc_size(int max_items)
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch((void *) itemptr,
- (void *) dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
-
- return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
- BlockNumber lblk,
- rblk;
- OffsetNumber loff,
- roff;
-
- lblk = ItemPointerGetBlockNumber((ItemPointer) left);
- rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
- if (lblk < rblk)
- return -1;
- if (lblk > rblk)
- return 1;
-
- loff = ItemPointerGetOffsetNumber((ItemPointer) left);
- roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
- if (loff < roff)
- return -1;
- if (loff > roff)
- return 1;
+ TidStore *dead_items = (TidStore *) state;
- return 0;
+ return tidstore_lookup_tid(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..4c0ce4b7e6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
* use small integers.
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 3
#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 4
#define PARALLEL_VACUUM_KEY_WAL_USAGE 5
@@ -103,6 +103,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TidStore */
+ tidstore_handle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ dsa_area *dead_items_area;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = tidstore_create(vac_work_mem, dead_items_dsa);
+ pvs->dead_items = dead_items;
+ pvs->dead_items_area = dead_items_dsa;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = tidstore_get_handle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ tidstore_destroy(pvs->dead_items);
+ dsa_detach(pvs->dead_items_area);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TidStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ tidstore_detach(pvs.dead_items);
+ dsa_detach(dead_items_area);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 196bece0a3..ff75fae88a 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -186,6 +186,8 @@ static const char *const BuiltinTrancheNames[] = {
"PgStatsHash",
/* LWTRANCHE_PGSTATS_DATA: */
"PgStatsData",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 92545b4958..3f8a5bc582 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2301,7 +2301,7 @@ struct config_int ConfigureNamesInt[] =
GUC_UNIT_KB
},
&maintenance_work_mem,
- 65536, 1024, MAX_KILOBYTES,
+ 65536, 2048, MAX_KILOBYTES,
NULL, NULL, NULL
},
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
#define PROGRESS_VACUUM_HEAP_BLKS_SCANNED 2
#define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED 3
#define PROGRESS_VACUUM_NUM_INDEX_VACUUMS 4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES 5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES 5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES 6
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..220d89fff7 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
MultiXactId MultiXactCutoff;
};
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TidStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int vac_work_mem,
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index e4162db613..40dda03088 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -204,6 +204,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DSA,
LWTRANCHE_PGSTATS_HASH,
LWTRANCHE_PGSTATS_DATA,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 542c2e098c..e678e6f79e 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -524,7 +524,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
-- ensure we don't use the index in CLUSTER nor the checking SELECTs
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
-- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..0c49354f04 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param3 AS heap_blks_scanned,
s.param4 AS heap_blks_vacuumed,
s.param5 AS index_vacuum_count,
- s.param6 AS max_dead_tuples,
- s.param7 AS num_dead_tuples
+ s.param6 AS max_dead_tuple_bytes,
+ s.param7 AS dead_tuple_bytes
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT s.stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index 6cb9c926c0..a795d705d5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -256,7 +256,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
--
2.31.1
On Fri, Dec 23, 2022 at 4:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Thu, Dec 22, 2022 at 10:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
If the value is a power of 2, it seems to work perfectly fine. But for
example if it's 700MB, the total memory exceeds the limit:2*(1+2+4+8+16+32+64+128) = 510MB (72.8% of 700MB) -> keep going
510 + 256 = 766MB -> stop but it exceeds the limit.In a more bigger case, if it's 11000MB,
2*(1+2+...+2048) = 8190MB (74.4%)
8190 + 4096 = 12286MBThat being said, I don't think they are not common cases. So the 75%
threshold seems to work fine in most cases.Thinking some more, I agree this doesn't have large practical risk, but thinking from the point of view of the community, being loose with memory limits by up to 10% is not a good precedent.
Agreed.
Perhaps we can be clever and use 75% when the limit is a power of two and 50% otherwise. I'm skeptical of trying to be clever, and I just thought of an additional concern: We're assuming behavior of the growth in size of new DSA segments, which could possibly change. Given how allocators are typically coded, though, it seems safe to assume that they'll at most double in size.
Sounds good to me.
I've written a simple script to simulate the DSA memory usage and the
limit. The 75% limit works fine for a power of two cases, and we can
use the 60% limit for other cases (it seems we can use up to about 66%
but used 60% for safety). It would be best if we can mathematically
prove it but I could prove only the power of two cases. But the script
practically shows the 60% threshold would work for these cases.
Regards
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
On Thu, Jan 12, 2023 at 9:51 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Thu, Jan 12, 2023 at 5:21 PM John Naylor
<john.naylor@enterprisedb.com> wrote:Okay, I'll squash the previous patch and work on cleaning up the
internals. I'll keep the external APIs the same so that your work on vacuum
integration can be easily rebased on top of that, and we can work
independently.
There were some conflicts with HEAD, so to keep the CF bot busy, I've
quickly put together v18. I still have a lot of cleanup work to do, but
this is enough for now.
0003 contains all v17 local-memory coding squashed together.
0004 perf test not updated but it doesn't build by default so it's fine for
now
0005 removes node.chunk as discussed, but does not change node4 fanout yet.
0006 is a small cleanup regarding setting node fanout.
0007 squashes my shared memory work with Masahiko's fixes from the addendum
v17-0010.
0008 turns the existence checks in RT_NODE_UPDATE_INNER into Asserts, as
discussed.
0009/0010 are just copies of Masauiko's v17 addendum v17-0011/12, but the
latter rebased over recent variable renaming (it's possible I missed
something, so worth checking).
I've implemented the idea of using union. Let me share WIP code for
discussion, I've attached three patches that can be applied on top of
Seems fine as far as the union goes. Let's go ahead with this, and make
progress on locking etc.
Overall, TidStore implementation with the union idea doesn't look so
ugly to me. But I got many compiler warning about unused radix tree
functions like:tidstore.c:99:19: warning: 'shared_rt_delete' defined but not used
[-Wunused-function]I'm not sure there is a convenient way to suppress this warning but
one idea is to have some macros to specify what operations are
enabled/declared.
That sounds like a good idea. It's also worth wondering if we even need
RT_NUM_ENTRIES at all, since the caller is capable of keeping track of that
if necessary. It's also misnamed, since it's concerned with the number of
keys. The vacuum case cares about the number of TIDs, and not number of
(encoded) keys. Even if we ever (say) changed the key to blocknumber and
value to Bitmapset, the number of keys might not be interesting. It sounds
like we should at least make the delete functionality optional. (Side note
on optional functions: if an implementation didn't care about iteration or
its order, we could optimize insertion into linear nodes)
Since this is WIP, you may already have some polish in mind, so I won't go
over the patches in detail, but I wanted to ask about a few things (numbers
referring to v17 addendum, not v18):
0011
+ * 'num_tids' is the number of Tids stored so far. 'max_byte' is the
maximum
+ * bytes a TidStore can use. These two fields are commonly used in both
+ * non-shared case and shared case.
+ */
+ uint32 num_tids;
uint32 is how we store the block number, so this too small and will wrap
around on overflow. int64 seems better.
+ * We calculate the maximum bytes for the TidStore in different ways
+ * for non-shared case and shared case. Please refer to the comment
+ * TIDSTORE_MEMORY_DEDUCT for details.
+ */
Maybe the #define and comment should be close to here.
+ * Destroy a TidStore, returning all memory. The caller must be certain
that
+ * no other backend will attempt to access the TidStore before calling this
+ * function. Other backend must explicitly call tidstore_detach to free up
+ * backend-local memory associated with the TidStore. The backend that
calls
+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
If not addressed by next patch, need to phrase comment with FIXME or TODO
about making certain.
+ * Add Tids on a block to TidStore. The caller must ensure the offset
numbers
+ * in 'offsets' are ordered in ascending order.
Must? What happens otherwise?
+ uint64 last_key = PG_UINT64_MAX;
I'm having some difficulty understanding this sentinel and how it's used.
@@ -1039,11 +1040,18 @@ lazy_scan_heap(LVRelState *vacrel)
if (prunestate.has_lpdead_items)
{
Size freespace;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ result = tidstore_iterate_next(iter);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+ buf, &vmbuffer);
+ Assert(!tidstore_iterate_next(iter));
+ tidstore_end_iterate(iter);
/* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ tidstore_reset(dead_items);
This part only runs "if (vacrel->nindexes == 0)", so seems like unneeded
complexity. It arises because lazy_scan_prune() populates the tid store
even if no index vacuuming happens. Perhaps the caller of lazy_scan_prune()
could pass the deadoffsets array, and upon returning, either populate the
store or call lazy_vacuum_heap_page(), as needed. It's quite possible I'm
missing some detail, so some description of the design choices made would
be helpful.
On Mon, Jan 16, 2023 at 9:53 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
I've written a simple script to simulate the DSA memory usage and the
limit. The 75% limit works fine for a power of two cases, and we can
use the 60% limit for other cases (it seems we can use up to about 66%
but used 60% for safety). It would be best if we can mathematically
prove it but I could prove only the power of two cases. But the script
practically shows the 60% threshold would work for these cases.
Okay. It's worth highlighting this in the comments, and also the fact that
it depends on internal details of how DSA increases segment size.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v18-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchapplication/x-patch; name=v18-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload
From c67b1955c95a036f93baaea8f43dcf49fa6e86f8 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v18 02/10] Move some bitmap logic out of bitmapset.c
Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.
Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().
Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.
Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
src/backend/nodes/bitmapset.c | 36 ++------------------------------
src/include/nodes/bitmapset.h | 16 ++++++++++++--
src/include/port/pg_bitutils.h | 31 +++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 -
4 files changed, 47 insertions(+), 37 deletions(-)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x))
/*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
{
int result;
- w = RIGHTMOST_ONE(w);
+ w = bmw_rightmost_one(w);
a->words[wordnum] &= ~w;
result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 0dca6bc5fa..80e91fac0f 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
#else
#define BITS_PER_BITMAPWORD 32
typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
#endif
@@ -75,6 +73,20 @@ typedef enum
BMS_MULTIPLE /* >1 member */
} BMS_Membership;
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w) pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
+#define bmw_popcount(w) pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w) pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
+#define bmw_popcount(w) pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
/*
* function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+ int32 result = (int32) word & -((int32) word);
+ return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+ int64 result = (int64) word & -((int64) word);
+ return (uint64) result;
+}
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 23bafec5f7..5bd3da4948 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3662,7 +3662,6 @@ shmem_request_hook_type
shmem_startup_hook_type
sig_atomic_t
sigjmp_buf
-signedbitmapword
sigset_t
size_t
slist_head
--
2.39.0
v18-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/x-patch; name=v18-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload
From b7935edac9046631ee9fca095bd8b3901cc5629b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v18 01/10] introduce vector8_min and vector8_highbit_mask
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..84d41a340a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
static inline bool vector8_has_zero(const Vector8 v);
static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
#endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
#endif
}
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+ uint32 mask = 0;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+ return mask;
+#endif
+}
+
/*
* Exactly like vector8_is_highbit_set except for the input type, so it
* looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.39.0
v18-0005-Remove-chunk-from-the-common-node-type.patchapplication/x-patch; name=v18-0005-Remove-chunk-from-the-common-node-type.patchDownload
From 4a385a0667e2489e6b4b850c2f7699049d652811 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Thu, 12 Jan 2023 20:32:06 +0700
Subject: [PATCH v18 05/10] Remove chunk from the common node type
This enabled a possible optimization for updating
the parent node's child pointer during node growth.
This is not likely to buy us much, and removing it
reduces the common type size to 5 bytes.
TODO: Reducing the smallest node to 3 members will
eliminate padding and only take up 32 bytes for
inner nodes.
---
src/include/lib/radixtree.h | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index b3d84da033..72735c4643 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -295,7 +295,6 @@ typedef struct RT_NODE
* RT_NODE_SPAN bits are then represented in chunk.
*/
uint8 shift;
- uint8 chunk;
/* Node kind, one per search/set algorithm */
uint8 kind;
@@ -964,7 +963,6 @@ static inline void
RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
{
newnode->shift = oldnode->shift;
- newnode->chunk = oldnode->chunk;
newnode->count = oldnode->count;
}
@@ -1026,7 +1024,6 @@ static void
RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
RT_PTR_ALLOC new_child, uint64 key)
{
- Assert(old_child->chunk == new_child->chunk);
Assert(old_child->shift == new_child->shift);
if (parent == old_child)
@@ -1074,8 +1071,8 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
n4->base.chunks[0] = 0;
n4->children[0] = tree->root;
- tree->root->chunk = 0;
- tree->root = node;
+ /* Update the root */
+ tree->ctl->root = allocnode;
shift += RT_NODE_SPAN;
}
@@ -1104,8 +1101,7 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent
newchild = (RT_PTR_LOCAL) allocchild;
RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
newchild->shift = newshift;
- newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
- RT_NODE_INSERT_INNER(tree, parent, node, key, newchild);
+ RT_NODE_INSERT_INNER(tree, parent, nodep, node, key, allocchild);
parent = node;
node = newchild;
@@ -1684,13 +1680,13 @@ rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
{
char space[125] = {0};
- fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
+ fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
NODE_IS_LEAF(node) ? "LEAF" : "INNR",
(node->kind == RT_NODE_KIND_4) ? 4 :
(node->kind == RT_NODE_KIND_32) ? 32 :
(node->kind == RT_NODE_KIND_125) ? 125 : 256,
node->fanout == 0 ? 256 : node->fanout,
- node->count, node->shift, node->chunk);
+ node->count, node->shift);
if (level > 0)
sprintf(space, "%*c", level * 4, ' ');
--
2.39.0
v18-0004-tool-for-measuring-radix-tree-performance.patchapplication/x-patch; name=v18-0004-tool-for-measuring-radix-tree-performance.patchDownload
From 3c7efecad8161974b7168b8a325ce1ae985774fd Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v18 04/10] tool for measuring radix tree performance
XXX: Not for commit
TODO: adjust for templating
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 76 +++
contrib/bench_radix_tree/bench_radix_tree.c | 635 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
6 files changed, 767 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..2fd689aa91
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..a0693695e6
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,635 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 search_time_ms;
+ Datum values[2] = {0};
+ bool nulls[2] = {0};
+ /* from trial and error */
+ uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (!PG_ARGISNULL(1))
+ {
+ char *filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+ if (sscanf(filter_str, "0x%lX", &filter) == 0)
+ elog(ERROR, "invalid filter string %s", filter_str);
+ }
+ elog(NOTICE, "bench with filter 0x%lX", filter);
+
+ rt = rt_create(CurrentMemoryContext);
+
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+
+ rt_set(rt, key, key);
+ }
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_sparseload_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ uint64 r,
+ h;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ key_id = 0;
+
+ for (r = 1; r <= fanout; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (h);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+
+ rt_stats(rt);
+
+ /* measure sparse deletion and re-loading */
+ start_time = GetCurrentTimestamp();
+
+ for (int t = 0; t<10000; t++)
+ {
+ /* delete one key in each leaf */
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ rt_delete(rt, key);
+ }
+
+ /* add them all back */
+ key_id = 0;
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_sparseload_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
--
2.39.0
v18-0003-Add-radixtree-template.patchapplication/x-patch; name=v18-0003-Add-radixtree-template.patchDownload
From 49a7e1f26c28668a35d96b3533ca59d88119a251 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v18 03/10] Add radixtree template
The only thing configurable at this point is function scope
and prefix, since the point is to see if this makes a shared
memory implementation clear and maintainable.
The key and value type are still hard-coded to uint64.
To make this more useful, at least value type should be
configurable.
It might be good at some point to offer a different tree type,
e.g. "single-value leaves" to allow for variable length keys
and values, giving full flexibility to developers.
---
src/include/lib/radixtree.h | 2018 +++++++++++++++++
src/include/lib/radixtree_delete_impl.h | 100 +
src/include/lib/radixtree_insert_impl.h | 293 +++
src/include/lib/radixtree_iter_impl.h | 129 ++
src/include/lib/radixtree_search_impl.h | 102 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 36 +
src/test/modules/test_radixtree/meson.build | 34 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 588 +++++
.../test_radixtree/test_radixtree.control | 4 +
src/tools/pginclude/cpluspluscheck | 6 +
src/tools/pginclude/headerscheck | 6 +
18 files changed, 3367 insertions(+)
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/include/lib/radixtree_delete_impl.h
create mode 100644 src/include/lib/radixtree_insert_impl.h
create mode 100644 src/include/lib/radixtree_iter_impl.h
create mode 100644 src/include/lib/radixtree_search_impl.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..b3d84da033
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2018 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves". We
+ * choose it to avoid an additional pointer traversal. It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included. Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * will result in radix tree type 'foo_radix_tree' and functions like
+ * 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ * generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ * declarations reside
+ *
+ * Optional parameters:
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE - Create a new, empty radix tree
+ * RT_FREE - Free the radix tree
+ * RT_SEARCH - Search a key-value pair
+ * RT_SET - Set a key-value pair
+ * RT_DELETE - Delete a key-value pair
+ * RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT - Return next key-value pair, if any
+ * RT_END_ITER - End iteration
+ * RT_MEMORY_USAGE - Get the memory usage
+ * RT_NUM_ENTRIES - Get the number of key-value pairs
+ *
+ * RT_CREATE() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * RT_ITERATE_NEXT() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#define RT_DELETE RT_MAKE_NAME(delete)
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#define RT_NUM_ENTRIES RT_MAKE_NAME(num_entries)
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_NODE_4_SEARCH_EQ RT_MAKE_NAME(node_4_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_4_GET_INSERTPOS RT_MAKE_NAME(node_4_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_ITER RT_MAKE_NAME(iter)
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_4 RT_MAKE_NAME(node_inner_4)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_4 RT_MAKE_NAME(node_leaf_4)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_4_FULL RT_MAKE_NAME(class_4_full)
+#define RT_CLASS_32_PARTIAL RT_MAKE_NAME(class_32_partial)
+#define RT_CLASS_32_FULL RT_MAKE_NAME(class_32_full)
+#define RT_CLASS_125_FULL RT_MAKE_NAME(class_125_full)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+#define RT_KIND_MIN_SIZE_CLASS RT_MAKE_NAME(kind_min_size_class)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+RT_SCOPE uint64 RT_NUM_ENTRIES(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif /* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* macros and types common to all implementations */
+#ifndef RT_COMMON
+#define RT_COMMON
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
+#define BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of RT_NODE. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+#endif /* RT_COMMON */
+
+
+typedef enum RT_SIZE_CLASS
+{
+ RT_CLASS_4_FULL = 0,
+ RT_CLASS_32_PARTIAL,
+ RT_CLASS_32_FULL,
+ RT_CLASS_125_FULL,
+ RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /* Max number of children. We can use uint8 because we never need to store 256 */
+ /* WIP: if we don't have a variable sized node4, this should instead be in the base
+ types as needed, since saving every byte is crucial for the smallest node kind */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+ uint8 chunk;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} RT_NODE;
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+
+#define NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
+#define NODE_IS_EMPTY(n) (((RT_PTR_LOCAL) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+ ((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+ ((node)->base.n.count < RT_SIZE_CLASS_INFO[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct RT_NODE_BASE_4
+{
+ RT_NODE n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+} RT_NODE_BASE_4;
+
+typedef struct RT_NODE_BASE_32
+{
+ RT_NODE n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct RT_NODE_BASE_125
+{
+ RT_NODE n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[BM_IDX(128)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+ RT_NODE n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ * width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct RT_NODE_INNER_4
+{
+ RT_NODE_BASE_4 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_4;
+
+typedef struct RT_NODE_LEAF_4
+{
+ RT_NODE_BASE_4 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_4;
+
+typedef struct RT_NODE_INNER_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct RT_NODE_INNER_256
+{
+ RT_NODE_BASE_256 base;
+
+ /* Slots for 256 children */
+ RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+ RT_NODE_BASE_256 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[BM_IDX(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ uint64 values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+
+ /* slab block size */
+ Size inner_blocksize;
+ Size leaf_blocksize;
+} RT_SIZE_CLASS_ELEM;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+ [RT_CLASS_4_FULL] = {
+ .name = "radix tree node 4",
+ .fanout = 4,
+ .inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_PARTIAL] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_FULL] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64)),
+ },
+ [RT_CLASS_125_FULL] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64)),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(RT_NODE_INNER_256),
+ .leaf_size = sizeof(RT_NODE_LEAF_256),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_256)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_256)),
+ },
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+/* Map from the node kind to its minimum size class */
+static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
+ [RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+ [RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+ [RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+ [RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE
+{
+ MemoryContext context;
+
+ RT_PTR_ALLOC root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct RT_NODE_ITER
+{
+ RT_PTR_LOCAL node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+ RT_RADIX_TREE *tree;
+
+ /* Track the iteration on nodes of each level */
+ RT_NODE_ITER stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+} RT_ITER;
+
+
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_LOCAL child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
+ uint64 key, uint64 value);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+RT_NODE_4_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+RT_NODE_4_GET_INSERTPOS(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+ uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, uint64 *src_values,
+ uint8 *dst_chunks, uint64 *dst_values)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(uint64) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return (node->children[chunk] != NULL);
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+ return node->children[chunk];
+}
+
+static inline uint64
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, uint64 value)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[idx] |= ((bitmapword) 1 << bitnum);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = NULL;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
+{
+ RT_PTR_ALLOC newnode;
+
+ if (inner)
+ newnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+ RT_SIZE_CLASS_INFO[size_class].inner_size);
+ else
+ newnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ RT_SIZE_CLASS_INFO[size_class].leaf_size);
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->cnt[size_class]++;
+#endif
+
+ return newnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner)
+{
+ if (inner)
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+ else
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+
+ node->kind = kind;
+ node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+ memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+ }
+
+ /*
+ * Technically it's 256, but we cannot store that in a uint8,
+ * and this is the max size class to it will never grow.
+ */
+ if (kind == RT_NODE_KIND_256)
+ node->fanout = 0;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+ int shift = RT_KEY_GET_SHIFT(key);
+ bool inner = shift > 0;
+ RT_PTR_ALLOC newnode;
+
+ newnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newnode->shift = shift;
+ tree->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+ tree->root = newnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->chunk = oldnode->chunk;
+ newnode->count = oldnode->count;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static RT_NODE*
+RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_LOCAL node, uint8 new_kind)
+{
+ RT_PTR_ALLOC newnode;
+ bool inner = !NODE_IS_LEAF(node);
+
+ newnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+ RT_INIT_NODE(newnode, new_kind, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+ RT_COPY_NODE(newnode, node);
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->root == node)
+ {
+ tree->root = NULL;
+ tree->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->cnt[i]--;
+ Assert(tree->cnt[i] >= 0);
+ }
+#endif
+
+ pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
+ RT_PTR_ALLOC new_child, uint64 key)
+{
+ Assert(old_child->chunk == new_child->chunk);
+ Assert(old_child->shift == new_child->shift);
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new large node */
+ tree->root = new_child;
+ }
+ else
+ {
+ bool replaced PG_USED_FOR_ASSERTS_ONLY;
+
+ replaced = RT_NODE_INSERT_INNER(tree, NULL, parent, key, new_child);
+ Assert(replaced);
+ }
+
+ RT_FREE_NODE(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+ int target_shift;
+ int shift = tree->root->shift + RT_NODE_SPAN;
+
+ target_shift = RT_KEY_GET_SHIFT(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ RT_NODE_INNER_4 *n4;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
+ node = (RT_PTR_LOCAL) allocnode;
+ RT_INIT_NODE(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+ node->shift = shift;
+ node->count = 1;
+
+ n4 = (RT_NODE_INNER_4 *) node;
+ n4->base.chunks[0] = 0;
+ n4->children[0] = tree->root;
+
+ tree->root->chunk = 0;
+ tree->root = node;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
+ RT_PTR_LOCAL node)
+{
+ int shift = node->shift;
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ RT_PTR_ALLOC allocchild;
+ RT_PTR_LOCAL newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool inner = newshift > 0;
+
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ newchild = (RT_PTR_LOCAL) allocchild;
+ RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newchild->shift = newshift;
+ newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ RT_NODE_INSERT_INNER(tree, parent, node, key, newchild);
+
+ parent = node;
+ node = newchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ RT_NODE_INSERT_LEAF(tree, parent, node, key, value);
+ tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/* Insert the child to the inner node */
+static bool
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node, uint64 key,
+ RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Insert the value to the leaf node */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
+ uint64 key, uint64 value)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+RT_CREATE(MemoryContext ctx)
+{
+ RT_RADIX_TREE *tree;
+ MemoryContext old_ctx;
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = palloc(sizeof(RT_RADIX_TREE));
+ tree->context = ctx;
+ tree->root = NULL;
+ tree->max_val = 0;
+ tree->num_keys = 0;
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ RT_SIZE_CLASS_INFO[i].name,
+ RT_SIZE_CLASS_INFO[i].inner_blocksize,
+ RT_SIZE_CLASS_INFO[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ RT_SIZE_CLASS_INFO[i].name,
+ RT_SIZE_CLASS_INFO[i].leaf_blocksize,
+ RT_SIZE_CLASS_INFO[i].leaf_size);
+#ifdef RT_DEBUG
+ tree->cnt[i] = 0;
+#endif
+ }
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
+{
+ int shift;
+ bool updated;
+ RT_PTR_LOCAL node;
+ RT_PTR_LOCAL parent;
+
+ /* Empty tree, create the root */
+ if (!tree->root)
+ RT_NEW_ROOT(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->max_val)
+ RT_EXTEND(tree, key);
+
+ Assert(tree->root);
+
+ shift = tree->root->shift;
+ node = parent = tree->root;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_LOCAL child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_SET_EXTEND(tree, key, value, parent, node);
+ return false;
+ }
+
+ parent = node;
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = RT_NODE_INSERT_LEAF(tree, parent, node, key, value);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->num_keys++;
+
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+
+ Assert(value_p != NULL);
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ node = tree->root;
+ shift = tree->root->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ return RT_NODE_SEARCH_LEAF(node, key, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+ if (!tree->root || key > tree->max_val)
+ return false;
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ node = tree->root;
+ shift = tree->root->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ RT_PTR_ALLOC child;
+
+ /* Push the current node to the stack */
+ stack[++level] = node;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ return false;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ Assert(NODE_IS_LEAF(node));
+ deleted = RT_NODE_DELETE_LEAF(node, key);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (!NODE_IS_EMPTY(node))
+ return true;
+
+ /* Free the empty leaf node */
+ RT_FREE_NODE(tree, node);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ node = stack[level--];
+
+ deleted = RT_NODE_DELETE_INNER(node, key);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!NODE_IS_EMPTY(node))
+ break;
+
+ /* The node became empty */
+ RT_FREE_NODE(tree, node);
+ }
+
+ return true;
+}
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+ uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+ int level = from;
+ RT_PTR_LOCAL node = from_node;
+
+ for (;;)
+ {
+ RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/* Create and return the iterator for the given radix tree */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+ MemoryContext old_ctx;
+ RT_ITER *iter;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree->root)
+ return iter;
+
+ top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ RT_UPDATE_ITER_STACK(iter, iter->tree->root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->root)
+ return false;
+
+ for (;;)
+ {
+ RT_PTR_LOCAL child = NULL;
+ uint64 value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+ pfree(iter);
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+RT_SCOPE uint64
+RT_NUM_ENTRIES(RT_RADIX_TREE *tree)
+{
+ return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+ Size total = sizeof(RT_RADIX_TREE);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE_BASE_4 *n4 = (RT_NODE_BASE_4 *) node;
+
+ for (int i = 1; i < n4->n.count; i++)
+ Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ uint8 slot = n125->slot_idxs[i];
+ int idx = BM_IDX(slot);
+ int bitnum = BM_BIT(slot);
+
+ if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(slot < node->fanout);
+ Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
+ cnt += bmw_popcount(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(RT_RADIX_TREE *tree)
+{
+ ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+ tree->num_keys,
+ tree->root->shift / RT_NODE_SPAN,
+ tree->cnt[RT_CLASS_4_FULL],
+ tree->cnt[RT_CLASS_32_PARTIAL],
+ tree->cnt[RT_CLASS_32_FULL],
+ tree->cnt[RT_CLASS_125_FULL],
+ tree->cnt[RT_CLASS_256])));
+}
+
+static void
+rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
+{
+ char space[125] = {0};
+
+ fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
+ NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift, node->chunk);
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n4->base.chunks[i], n4->values[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n4->base.chunks[i]);
+
+ if (recurse)
+ rt_dump_node(n4->children[i], level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n32->base.chunks[i], n32->values[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ rt_dump_node(n32->children[i], level + 1, recurse);
+ }
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+
+ fprintf(stderr, "slot_idxs ");
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+ }
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
+
+ fprintf(stderr, ", isset-bitmap:");
+ for (int i = 0; i < BM_IDX(128); i++)
+ {
+ fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
+ }
+ fprintf(stderr, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+ }
+ else
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(RT_NODE_INNER_125_GET_CHILD(n125, i),
+ level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, RT_NODE_LEAF_256_GET_VALUE(n256, i));
+ }
+ else
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(RT_NODE_INNER_256_GET_CHILD(n256, i), level + 1,
+ recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+ tree->max_val, tree->max_val);
+
+ if (!tree->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->max_val)
+ {
+ elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+ key, key);
+ return;
+ }
+
+ node = tree->root;
+ shift = tree->root->shift;
+ while (shift >= 0)
+ {
+ RT_PTR_LOCAL child;
+
+ rt_dump_node(node, level, false);
+
+ if (NODE_IS_LEAF(node))
+ {
+ uint64 dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+ break;
+ }
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+}
+
+void
+rt_dump(RT_RADIX_TREE *tree)
+{
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+ RT_SIZE_CLASS_INFO[i].name,
+ RT_SIZE_CLASS_INFO[i].inner_size,
+ RT_SIZE_CLASS_INFO[i].inner_blocksize,
+ RT_SIZE_CLASS_INFO[i].leaf_size,
+ RT_SIZE_CLASS_INFO[i].leaf_blocksize);
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+
+ if (!tree->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ rt_dump_node(tree->root, 0, true);
+}
+#endif
+
+#endif /* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+
+/* locally declared macros */
+#undef NODE_IS_LEAF
+#undef NODE_IS_EMPTY
+#undef VAR_NODE_HAS_FREE_SLOT
+#undef FIXED_NODE_HAS_FREE_SLOT
+#undef RT_SIZE_CLASS_COUNT
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_BASE_4
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_4
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_4
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_4_FULL
+#undef RT_CLASS_32_PARTIAL
+#undef RT_CLASS_32_FULL
+#undef RT_CLASS_125_FULL
+#undef RT_CLASS_256
+#undef RT_KIND_MIN_SIZE_CLASS
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_NUM_ENTRIES
+#undef RT_DUMP
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_GROW_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_NODE_4_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_4_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..6eefc63e19
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,100 @@
+/* TODO: shrink nodes */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ int idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, (uint64 *) n4->values,
+ n4->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
+ n4->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, (uint64 *) n32->values,
+ n32->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int idx;
+ int bitnum;
+
+ if (slotpos == RT_NODE_125_INVALID_IDX)
+ return false;
+
+ idx = BM_IDX(slotpos);
+ bitnum = BM_BIT(slotpos);
+ n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+ n125->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+ RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+ break;
+ }
+ }
+
+ /* update statistics */
+ node->count--;
+
+ return true;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..ff76583402
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,293 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+ RT_NODE *newnode = NULL;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(NODE_IS_LEAF(node));
+#else
+ Assert(!NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ int idx;
+
+ idx = RT_NODE_4_SEARCH_EQ(&n4->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n4->values[idx] = value;
+#else
+ n4->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ {
+ RT_NODE32_TYPE *new32;
+
+ /* grow node from 4 to 32 */
+ newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
+ new32 = (RT_NODE32_TYPE *) newnode;
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
+ new32->base.chunks, new32->values);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
+ new32->base.chunks, new32->children);
+#endif
+ Assert(parent != NULL);
+ RT_REPLACE_NODE(tree, parent, node, newnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_4_GET_INSERTPOS(&n4->base, chunk);
+ int count = n4->base.n.count;
+
+ /* shift chunks and children */
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n4->base.chunks, n4->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n4->base.chunks, n4->children,
+ count, insertpos);
+#endif
+ }
+
+ n4->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n4->values[insertpos] = value;
+#else
+ n4->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_PARTIAL];
+ const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_FULL];
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx;
+
+ idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[idx] = value;
+#else
+ n32->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+ n32->base.n.fanout == class32_min.fanout)
+ {
+ /* grow to the next size class of this kind */
+#ifdef RT_NODE_LEVEL_LEAF
+ newnode = RT_ALLOC_NODE(tree, RT_CLASS_32_FULL, false);
+ memcpy(newnode, node, class32_min.leaf_size);
+#else
+ newnode = RT_ALLOC_NODE(tree, RT_CLASS_32_FULL, true);
+ memcpy(newnode, node, class32_min.inner_size);
+#endif
+ newnode->fanout = class32_max.fanout;
+
+ Assert(parent != NULL);
+ RT_REPLACE_NODE(tree, parent, node, newnode, key);
+ node = newnode;
+
+ /* also update pointer for this kind */
+ n32 = (RT_NODE32_TYPE *) newnode;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ {
+ RT_NODE125_TYPE *new125;
+
+ Assert(n32->base.n.fanout == class32_max.fanout);
+
+ /* grow node from 32 to 125 */
+ newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
+ new125 = (RT_NODE125_TYPE *) newnode;
+
+ for (int i = 0; i < class32_max.fanout; i++)
+ {
+ new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ new125->values[i] = n32->values[i];
+#else
+ new125->children[i] = n32->children[i];
+#endif
+ }
+
+ Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+ new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+ Assert(parent != NULL);
+ RT_REPLACE_NODE(tree, parent, node, newnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+ int count = n32->base.n.count;
+
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+ count, insertpos);
+#endif
+ }
+
+ n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = value;
+#else
+ n32->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int cnt = 0;
+
+ if (slotpos != RT_NODE_125_INVALID_IDX)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = value;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ {
+ RT_NODE256_TYPE *new256;
+
+ /* grow node from 125 to 256 */
+ newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
+ new256 = (RT_NODE256_TYPE *) newnode;
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+ RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ cnt++;
+ }
+
+ Assert(parent != NULL);
+ RT_REPLACE_NODE(tree, parent, node, newnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int idx;
+ bitmapword inverse;
+
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < BM_IDX(128); idx++)
+ {
+ if (n125->base.isset[idx] < ~((bitmapword) 0))
+ break;
+ }
+
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(n125->base.isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+ Assert(slotpos < node->fanout);
+
+ /* mark the slot used */
+ n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+ n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = value;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+#else
+ chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
+#endif
+ Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(n256, chunk, value);
+#else
+ RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ RT_VERIFY_NODE(node);
+
+ return chunk_exists;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..a153011376
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,129 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+#ifdef RT_NODE_LEVEL_LEAF
+ uint64 value;
+#else
+ RT_NODE *child = NULL;
+#endif
+ bool found = false;
+ uint8 key_chunk;
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n4->values[node_iter->current_idx];
+#else
+ child = n4->children[node_iter->current_idx];
+#endif
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[node_iter->current_idx];
+#else
+ child = n32->children[node_iter->current_idx];
+#endif
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+ child = RT_NODE_INNER_125_GET_CHILD(n125, i);
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+#ifdef RT_NODE_LEVEL_LEAF
+ if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+ if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+ child = RT_NODE_INNER_256_GET_CHILD(n256, i);
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = value;
+#endif
+ }
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return found;
+#else
+ return child;
+#endif
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..cbc357dcc8
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,102 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ uint64 value = 0;
+#else
+ RT_PTR_LOCAL child = NULL;
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ int idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n4->values[idx];
+#else
+ child = n4->children[idx];
+#endif
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[idx];
+#else
+ child = n32->children[idx];
+#endif
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+
+ if (!RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, chunk))
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+ child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+ child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+ break;
+ }
+ }
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(value_p != NULL);
+ *value_p = value;
+#else
+ Assert(child_p != NULL);
+ *child_p = child;
+#endif
+
+ return true;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
test_pg_db_role_setting \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
subdir('test_pg_db_role_setting')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..2256d08100
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,588 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int rt_node_kind_fanouts[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 125, /* RT_NODE_KIND_125 */
+ 256 /* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ uint64 dummy;
+ uint64 key;
+ uint64 val;
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ rt_radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_set(radixtree, keys[i], keys[i] + 1))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ uint64 val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx - 1]
+ : rt_node_kind_fanouts[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx]
+ : rt_node_kind_fanouts[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ rt_radix_tree *radixtree;
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+ radixtree = rt_create(CurrentMemoryContext);
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+ radixtree = rt_create(radixtree_ctx);
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ test_basic(rt_node_kind_fanouts[i], false);
+ test_basic(rt_node_kind_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
--
2.39.0
v18-0006-Clarify-coding-around-fanout.patchapplication/x-patch; name=v18-0006-Clarify-coding-around-fanout.patchDownload
From f0bac77d49a88c82f4725bed5688d5f1e01dbe49 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Thu, 12 Jan 2023 20:39:19 +0700
Subject: [PATCH v18 06/10] Clarify coding around fanout
Change assignment of node256's fanout to an
assert and add some comments to the fanout
member of the RT_NODE struct.
---
src/include/lib/radixtree.h | 26 +++++++++++++++-----------
1 file changed, 15 insertions(+), 11 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 72735c4643..a02e835cd6 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -284,9 +284,15 @@ typedef struct RT_NODE
*/
uint16 count;
- /* Max number of children. We can use uint8 because we never need to store 256 */
- /* WIP: if we don't have a variable sized node4, this should instead be in the base
- types as needed, since saving every byte is crucial for the smallest node kind */
+ /*
+ * Max capacity for the current size class. Storing this in the
+ * node enables multiple size classes per node kind.
+ * Technically, kinds with a single size class don't need this, so we could
+ * keep this in the individual base types, but the code is simpler this way.
+ * Note: node256 is unique in that it cannot possibly have more than a
+ * single size class, so for that kind we store zero, and uint8 is
+ * sufficient for other kinds.
+ */
uint8 fanout;
/*
@@ -923,7 +929,12 @@ RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner
MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
node->kind = kind;
- node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+ if (kind == RT_NODE_KIND_256)
+ /* See comment for the RT_NODE type */
+ Assert(node->fanout == 0);
+ else
+ node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
/* Initialize slot_idxs to invalid values */
if (kind == RT_NODE_KIND_125)
@@ -932,13 +943,6 @@ RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner
memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
}
-
- /*
- * Technically it's 256, but we cannot store that in a uint8,
- * and this is the max size class to it will never grow.
- */
- if (kind == RT_NODE_KIND_256)
- node->fanout = 0;
}
/*
--
2.39.0
v18-0009-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchapplication/x-patch; name=v18-0009-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload
From 9ac5e8839bdf57eeaf357d3f1406b288c022edab Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v18 09/10] Add TIDStore, to store sets of TIDs
(ItemPointerData) efficiently.
The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.
The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.
This includes a unit test module, in src/test/modules/test_tidstore.
---
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 587 ++++++++++++++++++
src/include/access/tidstore.h | 49 ++
src/test/modules/test_tidstore/Makefile | 23 +
.../test_tidstore/expected/test_tidstore.out | 13 +
src/test/modules/test_tidstore/meson.build | 34 +
.../test_tidstore/sql/test_tidstore.sql | 7 +
.../test_tidstore/test_tidstore--1.0.sql | 8 +
.../test_tidstore/test_tidstore.control | 4 +
10 files changed, 727 insertions(+)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
create mode 100644 src/test/modules/test_tidstore/Makefile
create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
create mode 100644 src/test/modules/test_tidstore/meson.build
create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore.control
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..4170d13b3c
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,587 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, Tid are encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach(). It can support concurrent updates but only one process
+ * is allowed to iterate over the TidStore at a time.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, item pointers are represented as a pair of 64-bit
+ * key and 64-bit value. First, we construct 64-bit unsigned integer key that
+ * combines the block number and the offset number. The lowest 11 bits represent
+ * the offset number, and the next 32 bits are block number. That is, only 43
+ * bits are used:
+ *
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ *
+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TidSTORE_OFFSET_NBITS). We are frugal with
+ * the bits, because smaller keys could help keeping the radix tree shallow.
+ *
+ * XXX: If we want to support other table AMs that want to use the full range
+ * of possible offset numbers, we'll need to change this.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits, and
+ * the rest 37 bits are used as the key:
+ *
+ * value = bitmap representation of XXXXXX
+ * key = XXXXXYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYuu
+ *
+ * The maximum height of the radix tree is 5.
+ *
+ * XXX: if we want to support non-heap table AM, we need to reconsider
+ * TIDSTORE_OFFSET_NBITS value.
+ */
+#define TIDSTORE_OFFSET_NBITS 11
+#define TIDSTORE_VALUE_NBITS 6
+
+/*
+ * Memory consumption depends on the number of Tids stored, but also on the
+ * distribution of them and how the radix tree stores them. The maximum bytes
+ * that a TidStore can use is specified by the max_bytes in tidstore_create().
+ *
+ * In non-shared cases, the radix tree uses a slab allocator for each kind of
+ * node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate the
+ * largest radix tree node in a new slab block, which is approximately 70kB.
+ * Therefore, we deduct 70kB from the maximum bytes.
+ *
+ * In shared cases, DSA allocates the memory segments to bit enough to follow
+ * a geometric series that approximately doubles the total DSA size. So we
+ * limit the maximum bytes for a TidStore to 75%. The 75% threshold perfectly
+ * works in case where the maximum bytes is power-of-2. In other cases, we
+ * use 60& threshold.
+ */
+#define TIDSTORE_MEMORY_DEDUCT_BYTES (1024L * 70) /* 70kB */
+
+/* Get block number from the key */
+#define KEY_GET_BLKNO(key) \
+ ((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+/* The header object for a TidStore */
+typedef struct TidStoreControl
+{
+ /*
+ * 'num_tids' is the number of Tids stored so far. 'max_byte' is the maximum
+ * bytes a TidStore can use. These two fields are commonly used in both
+ * non-shared case and shared case.
+ */
+ uint32 num_tids;
+ uint64 max_bytes;
+
+ /* The below fields are used only in shared case */
+
+ uint32 magic;
+
+ /* handles for TidStore and radix tree */
+ tidstore_handle handle;
+ shared_rt_handle tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+ /*
+ * Control object. This is allocated in DSA area 'area' in the shared
+ * case, otherwise in backend-local memory.
+ */
+ TidStoreControl *control;
+
+ /* Storage for Tids */
+ union
+ {
+ local_rt_radix_tree *local;
+ shared_rt_radix_tree *shared;
+ } tree;
+
+ /* DSA area for TidStore if used */
+ dsa_area *area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+ TidStore *ts;
+
+ /* iterator of radix tree */
+ union
+ {
+ shared_rt_iter *shared;
+ local_rt_iter *local;
+ } tree_iter;
+
+ /* we returned all tids? */
+ bool finished;
+
+ /* save for the next iteration */
+ uint64 next_key;
+ uint64 next_val;
+
+ /* output for the caller */
+ TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(uint64 max_bytes, dsa_area *area)
+{
+ TidStore *ts;
+
+ ts = palloc0(sizeof(TidStore));
+
+ /*
+ * Create the radix tree for the main storage.
+ *
+ * We calculate the maximum bytes for the TidStore in different ways
+ * for non-shared case and shared case. Please refer to the comment
+ * TIDSTORE_MEMORY_DEDUCT for details.
+ */
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+ float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, area);
+
+ dp = dsa_allocate0(area, sizeof(TidStoreControl));
+ ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+ ts->control->max_bytes =(uint64) (max_bytes * ratio);
+ ts->area = area;
+
+ ts->control->magic = TIDSTORE_MAGIC;
+ ts->control->handle = dp;
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+ }
+ else
+ {
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+ ts->control->max_bytes = max_bytes - TIDSTORE_MEMORY_DEDUCT_BYTES;
+ }
+
+ return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ /* create per-backend state */
+ ts = palloc0(sizeof(TidStore));
+
+ /* Find the control object in shared memory */
+ control = handle;
+
+ /* Set up the TidStore */
+ ts->control = (TidStoreControl *) dsa_get_address(area, control);
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+ ts->area = area;
+
+ return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ shared_rt_detach(ts->tree.shared);
+ pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory. The caller must be certain that
+ * no other backend will attempt to access the TidStore before calling this
+ * function. Other backend must explicitly call tidstore_detach to free up
+ * backend-local memory associated with the TidStore. The backend that calls
+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ ts->control->magic = 0;
+ dsa_free(ts->area, ts->control->handle);
+ shared_rt_free(ts->tree.shared);
+ }
+ else
+ {
+ pfree(ts->control);
+ local_rt_free(ts->tree.local);
+ }
+
+ pfree(ts);
+}
+
+/* Forget all collected Tids */
+void
+tidstore_reset(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+
+ /*
+ * Free the current radix tree, and Return allocated DSM segments
+ * to the operating system, if necessary. */
+ if (TidStoreIsShared(ts))
+ {
+ shared_rt_free(ts->tree.shared);
+ dsa_trim(ts->area);
+
+ /* Recreate the radix tree */
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area);
+
+ /* update the radix tree handle as we recreated it */
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+ }
+ else
+ {
+ local_rt_free(ts->tree.local);
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+ }
+}
+
+static inline void
+tidstore_insert_kv(TidStore *ts, uint64 key, uint64 val)
+{
+ if (TidStoreIsShared(ts))
+ shared_rt_set(ts->tree.shared, key, val);
+ else
+ local_rt_set(ts->tree.local, key, val);
+}
+
+/*
+ * Add Tids on a block to TidStore. The caller must ensure the offset numbers
+ * in 'offsets' are ordered in ascending order.
+ */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 key;
+ uint64 val = 0;
+ ItemPointerData tid;
+
+ ItemPointerSetBlockNumber(&tid, blkno);
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint32 off;
+
+ ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+ key = tid_to_key_off(&tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ /* insert the key-value */
+ tidstore_insert_kv(ts, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= UINT64CONST(1) << off;
+ }
+
+ if (last_key != PG_UINT64_MAX)
+ {
+ /* insert the key-value */
+ tidstore_insert_kv(ts, last_key, val);
+ }
+
+ /* update statistics */
+ ts->control->num_tids += num_offsets;
+}
+
+/* Return true if the given Tid is present in TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val;
+ uint32 off;
+ bool found;
+
+ key = tid_to_key_off(tid, &off);
+
+ found = TidStoreIsShared(ts) ?
+ shared_rt_search(ts->tree.shared, key, &val) :
+ local_rt_search(ts->tree.local, key, &val);
+
+ if (!found)
+ return false;
+
+ return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. The caller must be certain that
+ * no other backend will attempt to update the TidStore during the iteration.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+ TidStoreIter *iter;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ iter = palloc0(sizeof(TidStoreIter));
+ iter->ts = ts;
+ iter->result.blkno = InvalidBlockNumber;
+
+ if (TidStoreIsShared(ts))
+ iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+ else
+ iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+ /* If the TidStore is empty, there is no business */
+ if (ts->control->num_tids == 0)
+ iter->finished = true;
+
+ return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+ else
+ return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a TidStoreIterResult representing Tids
+ * in one page. Offset numbers in the result is sorted.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+ TidStoreIterResult *result = &(iter->result);
+
+ if (iter->finished)
+ return NULL;
+
+ if (BlockNumberIsValid(result->blkno))
+ {
+ result->num_offsets = 0;
+ tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (tidstore_iter_kv(iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = KEY_GET_BLKNO(key);
+
+ if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+ {
+ /*
+ * Remember the key-value pair for the next block for the
+ * next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+ return result;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_extract_tids(iter, key, val);
+ }
+
+ iter->finished = true;
+ return result;
+}
+
+/* Finish an iteration over TidStore */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+ if (TidStoreIsShared(iter->ts))
+ shared_rt_end_iterate(iter->tree_iter.shared);
+ else
+ local_rt_end_iterate(iter->tree_iter.local);
+
+ pfree(iter);
+}
+
+/* Return the number of Tids we collected so far */
+uint64
+tidstore_num_tids(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+uint64
+tidstore_max_memory(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+uint64
+tidstore_memory_usage(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return (uint64) sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+ else
+ return (uint64) sizeof(TidStore) + sizeof(TidStore) +
+ local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->handle;
+}
+
+/* Extract Tids from key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+ TidStoreIterResult *result = (&iter->result);
+
+ for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ if ((val & (UINT64CONST(1) << i)) == 0)
+ continue;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= i;
+
+ off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+ result->offsets[result->num_offsets++] = off;
+ }
+
+ result->blkno = KEY_GET_BLKNO(key);
+}
+
+/*
+ * Encode a Tid to key and val.
+ */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint64 tid_i;
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+ *off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+ upper = tid_i >> TIDSTORE_VALUE_NBITS;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ return upper;
+}
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..4bffdf0920
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "lib/radixtree.h"
+#include "storage/itemptr.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+ BlockNumber blkno;
+ OffsetNumber offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+ int num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(uint64 max_bytes, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern uint64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern uint64 tidstore_max_memory(TidStore *ts);
+extern uint64 tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif /* TIDSTORE_H */
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..1973963440
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE: testing empty tidstore
+NOTICE: testing basic operations
+ test_tidstore
+---------------
+
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..3365b073a4
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+ 'test_tidstore.c',
+)
+
+if host_system == 'windows'
+ test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_tidstore',
+ '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+ test_tidstore_sources,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+ 'test_tidstore.control',
+ 'test_tidstore--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_tidstore',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_tidstore',
+ ],
+ },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
--
2.39.0
v18-0008-Turn-branch-into-Assert-in-RT_NODE_UPDATE_INNER.patchapplication/x-patch; name=v18-0008-Turn-branch-into-Assert-in-RT_NODE_UPDATE_INNER.patchDownload
From 009c01a67817389fc5972d848334c1da00e8864c Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 15 Jan 2023 14:31:42 +0700
Subject: [PATCH v18 08/10] Turn branch into Assert in RT_NODE_UPDATE_INNER
---
src/include/lib/radixtree.h | 9 ++----
src/include/lib/radixtree_search_impl.h | 41 ++++++++++++++-----------
2 files changed, 25 insertions(+), 25 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index e053a2e56e..9f8bed09f7 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1129,7 +1129,7 @@ RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
#endif
}
-static inline bool
+static inline void
RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
{
#define RT_ACTION_UPDATE
@@ -1160,12 +1160,7 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child
tree->ctl->root = new_child;
}
else
- {
- bool replaced PG_USED_FOR_ASSERTS_ONLY;
-
- replaced = RT_NODE_UPDATE_INNER(parent, key, new_child);
- Assert(replaced);
- }
+ RT_NODE_UPDATE_INNER(parent, key, new_child);
RT_FREE_NODE(tree, old_child);
}
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index 3e97c31c2c..31e4978e4f 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -32,18 +32,19 @@
RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
int idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n4->children[idx] = new_child;
+#else
if (idx < 0)
return false;
#ifdef RT_NODE_LEVEL_LEAF
value = n4->values[idx];
-#else
-#ifdef RT_ACTION_UPDATE
- n4->children[idx] = new_child;
#else
child = n4->children[idx];
#endif
-#endif
+#endif /* RT_ACTION_UPDATE */
break;
}
case RT_NODE_KIND_32:
@@ -51,18 +52,19 @@
RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n32->children[idx] = new_child;
+#else
if (idx < 0)
return false;
#ifdef RT_NODE_LEVEL_LEAF
value = n32->values[idx];
-#else
-#ifdef RT_ACTION_UPDATE
- n32->children[idx] = new_child;
#else
child = n32->children[idx];
#endif
-#endif
+#endif /* RT_ACTION_UPDATE */
break;
}
case RT_NODE_KIND_125:
@@ -70,24 +72,28 @@
RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
int slotpos = n125->base.slot_idxs[chunk];
+#ifdef RT_ACTION_UPDATE
+ Assert(slotpos != RT_NODE_125_INVALID_IDX);
+ n125->children[slotpos] = new_child;
+#else
if (slotpos == RT_NODE_125_INVALID_IDX)
return false;
#ifdef RT_NODE_LEVEL_LEAF
value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
-#else
-#ifdef RT_ACTION_UPDATE
- n125->children[slotpos] = new_child;
#else
child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
#endif
-#endif
+#endif /* RT_ACTION_UPDATE */
break;
}
case RT_NODE_KIND_256:
{
RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+#ifdef RT_ACTION_UPDATE
+ RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
#ifdef RT_NODE_LEVEL_LEAF
if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
#else
@@ -97,28 +103,27 @@
#ifdef RT_NODE_LEVEL_LEAF
value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
-#else
-#ifdef RT_ACTION_UPDATE
- RT_NODE_INNER_256_SET(n256, chunk, new_child);
#else
child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
#endif
-#endif
+#endif /* RT_ACTION_UPDATE */
break;
}
}
-#ifndef RT_ACTION_UPDATE
+#ifdef RT_ACTION_UPDATE
+ return;
+#else
#ifdef RT_NODE_LEVEL_LEAF
Assert(value_p != NULL);
*value_p = value;
#else
Assert(child_p != NULL);
*child_p = child;
-#endif
#endif
return true;
+#endif /* RT_ACTION_UPDATE */
#undef RT_NODE4_TYPE
#undef RT_NODE32_TYPE
--
2.39.0
v18-0007-Implement-shared-memory.patchapplication/x-patch; name=v18-0007-Implement-shared-memory.patchDownload
From ba09c9cb0b6abd31454ef286b8012f1e4d968d8b Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 9 Jan 2023 14:32:39 +0700
Subject: [PATCH v18 07/10] Implement shared memory
---
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 434 ++++++++++++++----
src/include/lib/radixtree_delete_impl.h | 6 +
src/include/lib/radixtree_insert_impl.h | 43 +-
src/include/lib/radixtree_iter_impl.h | 23 +-
src/include/lib/radixtree_search_impl.h | 28 +-
src/include/utils/dsa.h | 1 +
.../modules/test_radixtree/test_radixtree.c | 43 ++
8 files changed, 469 insertions(+), 121 deletions(-)
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 604b702a91..50f0aae3ab 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index a02e835cd6..e053a2e56e 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -42,6 +42,8 @@
* - RT_DEFINE - if defined function definitions are generated
* - RT_SCOPE - in which scope (e.g. extern, static inline) do function
* declarations reside
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ * so that multiple processes can access it simultaneously.
*
* Optional parameters:
* - RT_DEBUG - if defined add stats tracking and debugging functions
@@ -51,6 +53,9 @@
*
* RT_CREATE - Create a new, empty radix tree
* RT_FREE - Free the radix tree
+ * RT_ATTACH - Attach to the radix tree
+ * RT_DETACH - Detach from the radix tree
+ * RT_GET_HANDLE - Return the handle of the radix tree
* RT_SEARCH - Search a key-value pair
* RT_SET - Set a key-value pair
* RT_DELETE - Delete a key-value pair
@@ -80,7 +85,8 @@
#include "miscadmin.h"
#include "nodes/bitmapset.h"
#include "port/pg_bitutils.h"
-#include "port/pg_lfind.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
#include "utils/memutils.h"
/* helpers */
@@ -92,6 +98,11 @@
#define RT_CREATE RT_MAKE_NAME(create)
#define RT_FREE RT_MAKE_NAME(free)
#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
#define RT_SET RT_MAKE_NAME(set)
#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
@@ -110,9 +121,11 @@
#define RT_FREE_NODE RT_MAKE_NAME(free_node)
#define RT_EXTEND RT_MAKE_NAME(extend)
#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
-#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+//#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
#define RT_NODE_4_SEARCH_EQ RT_MAKE_NAME(node_4_search_eq)
#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
#define RT_NODE_4_GET_INSERTPOS RT_MAKE_NAME(node_4_get_insertpos)
@@ -138,6 +151,7 @@
#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
@@ -150,7 +164,11 @@
/* type declarations */
#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
#define RT_NODE RT_MAKE_NAME(node)
#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
#define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
@@ -181,8 +199,20 @@
typedef struct RT_RADIX_TREE RT_RADIX_TREE;
typedef struct RT_ITER RT_ITER;
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
@@ -306,9 +336,21 @@ typedef struct RT_NODE
uint8 kind;
} RT_NODE;
+
#define RT_PTR_LOCAL RT_NODE *
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
#define NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
#define NODE_IS_EMPTY(n) (((RT_PTR_LOCAL) (n))->count == 0)
@@ -516,22 +558,43 @@ static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
[RT_NODE_KIND_256] = RT_CLASS_256,
};
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
/* A radix tree with nodes */
-typedef struct RT_RADIX_TREE
+typedef struct RT_RADIX_TREE_CONTROL
{
- MemoryContext context;
+#ifdef RT_SHMEM
+ RT_HANDLE handle;
+ uint32 magic;
+#endif
RT_PTR_ALLOC root;
uint64 max_val;
uint64 num_keys;
- MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
- MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
-
/* statistics */
#ifdef RT_DEBUG
int32 cnt[RT_SIZE_CLASS_COUNT];
#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE
+{
+ MemoryContext context;
+
+ /* pointing to either local memory or DSA */
+ RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+ dsa_area *dsa;
+#else
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
} RT_RADIX_TREE;
/*
@@ -547,6 +610,11 @@ typedef struct RT_RADIX_TREE
* construct the key whenever updating the node iteration information, e.g., when
* advancing the current index within the node or when moving to the next node
* at the same level.
++ *
++ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
++ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
++ * We need either a safeguard to disallow other processes to begin the iteration
++ * while one process is doing or to allow multiple processes to do the iteration.
*/
typedef struct RT_NODE_ITER
{
@@ -567,14 +635,35 @@ typedef struct RT_ITER
} RT_ITER;
-static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
- uint64 key, RT_PTR_LOCAL child);
-static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
uint64 key, uint64 value);
/* verification (available only with assertion) */
static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+ return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+ return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+ return DsaPointerIsValid(ptr);
+#else
+ return PointerIsValid(ptr);
+#endif
+}
+
/*
* Return index of the first element in 'base' that equals 'key'. Return -1
* if there is no such element.
@@ -806,7 +895,7 @@ static inline bool
RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
{
Assert(!NODE_IS_LEAF(node));
- return (node->children[chunk] != NULL);
+ return node->children[chunk] != RT_INVALID_PTR_ALLOC;
}
static inline bool
@@ -860,7 +949,7 @@ static inline void
RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
{
Assert(!NODE_IS_LEAF(node));
- node->children[chunk] = NULL;
+ node->children[chunk] = RT_INVALID_PTR_ALLOC;
}
static inline void
@@ -902,21 +991,31 @@ RT_SHIFT_GET_MAX_VAL(int shift)
static RT_PTR_ALLOC
RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
{
- RT_PTR_ALLOC newnode;
+ RT_PTR_ALLOC allocnode;
+ size_t allocsize;
if (inner)
- newnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
- RT_SIZE_CLASS_INFO[size_class].inner_size);
+ allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
else
- newnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
- RT_SIZE_CLASS_INFO[size_class].leaf_size);
+ allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+
+#ifdef RT_SHMEM
+ allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+ if (inner)
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+ allocsize);
+ else
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ allocsize);
+#endif
#ifdef RT_DEBUG
/* update the statistics */
- tree->cnt[size_class]++;
+ tree->ctl->cnt[size_class]++;
#endif
- return newnode;
+ return allocnode;
}
/* Initialize the node contents */
@@ -954,13 +1053,15 @@ RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
{
int shift = RT_KEY_GET_SHIFT(key);
bool inner = shift > 0;
- RT_PTR_ALLOC newnode;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
- newnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
newnode->shift = shift;
- tree->max_val = RT_SHIFT_GET_MAX_VAL(shift);
- tree->root = newnode;
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+ tree->ctl->root = allocnode;
}
static inline void
@@ -969,7 +1070,7 @@ RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
newnode->shift = oldnode->shift;
newnode->count = oldnode->count;
}
-
+#if 0
/*
* Create a new node with 'new_kind' and the same shift, chunk, and
* count of 'node'.
@@ -977,30 +1078,33 @@ RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
static RT_NODE*
RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_LOCAL node, uint8 new_kind)
{
- RT_PTR_ALLOC newnode;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
bool inner = !NODE_IS_LEAF(node);
- newnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+ allocnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
RT_INIT_NODE(newnode, new_kind, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
RT_COPY_NODE(newnode, node);
return newnode;
}
-
+#endif
/* Free the given node */
static void
-RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
{
/* If we're deleting the root node, make the tree empty */
- if (tree->root == node)
+ if (tree->ctl->root == allocnode)
{
- tree->root = NULL;
- tree->max_val = 0;
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+ tree->ctl->max_val = 0;
}
#ifdef RT_DEBUG
{
int i;
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
/* update the statistics */
for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
@@ -1013,12 +1117,26 @@ RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
if (i == RT_SIZE_CLASS_COUNT)
i = RT_CLASS_256;
- tree->cnt[i]--;
- Assert(tree->cnt[i] >= 0);
+ tree->ctl->cnt[i]--;
+ Assert(tree->ctl->cnt[i] >= 0);
}
#endif
- pfree(node);
+#ifdef RT_SHMEM
+ dsa_free(tree->dsa, allocnode);
+#else
+ pfree(allocnode);
+#endif
+}
+
+static inline bool
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
}
/*
@@ -1028,18 +1146,24 @@ static void
RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
RT_PTR_ALLOC new_child, uint64 key)
{
- Assert(old_child->shift == new_child->shift);
+ RT_PTR_LOCAL old = RT_PTR_GET_LOCAL(tree, old_child);
+
+#ifdef USE_ASSERT_CHECKING
+ RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+ Assert(old->shift == new->shift);
+#endif
- if (parent == old_child)
+ if (parent == old)
{
/* Replace the root node with the new large node */
- tree->root = new_child;
+ tree->ctl->root = new_child;
}
else
{
- bool replaced PG_USED_FOR_ASSERTS_ONLY;
+ bool replaced PG_USED_FOR_ASSERTS_ONLY;
- replaced = RT_NODE_INSERT_INNER(tree, NULL, parent, key, new_child);
+ replaced = RT_NODE_UPDATE_INNER(parent, key, new_child);
Assert(replaced);
}
@@ -1054,7 +1178,8 @@ static void
RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
{
int target_shift;
- int shift = tree->root->shift + RT_NODE_SPAN;
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ int shift = root->shift + RT_NODE_SPAN;
target_shift = RT_KEY_GET_SHIFT(key);
@@ -1066,14 +1191,14 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
RT_NODE_INNER_4 *n4;
allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
- node = (RT_PTR_LOCAL) allocnode;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
RT_INIT_NODE(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
node->shift = shift;
node->count = 1;
n4 = (RT_NODE_INNER_4 *) node;
n4->base.chunks[0] = 0;
- n4->children[0] = tree->root;
+ n4->children[0] = tree->ctl->root;
/* Update the root */
tree->ctl->root = allocnode;
@@ -1081,7 +1206,7 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
shift += RT_NODE_SPAN;
}
- tree->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
}
/*
@@ -1090,10 +1215,12 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
*/
static inline void
RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
- RT_PTR_LOCAL node)
+ RT_PTR_ALLOC nodep, RT_PTR_LOCAL node)
{
int shift = node->shift;
+ Assert(RT_PTR_GET_LOCAL(tree, nodep) == node);
+
while (shift >= RT_NODE_SPAN)
{
RT_PTR_ALLOC allocchild;
@@ -1102,18 +1229,19 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent
bool inner = newshift > 0;
allocchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
- newchild = (RT_PTR_LOCAL) allocchild;
+ newchild = RT_PTR_GET_LOCAL(tree, allocchild);
RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
newchild->shift = newshift;
RT_NODE_INSERT_INNER(tree, parent, nodep, node, key, allocchild);
parent = node;
node = newchild;
+ nodep = allocchild;
shift -= RT_NODE_SPAN;
}
- RT_NODE_INSERT_LEAF(tree, parent, node, key, value);
- tree->num_keys++;
+ RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+ tree->ctl->num_keys++;
}
/*
@@ -1172,8 +1300,8 @@ RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
/* Insert the child to the inner node */
static bool
-RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node, uint64 key,
- RT_PTR_ALLOC child)
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child)
{
#define RT_NODE_LEVEL_INNER
#include "lib/radixtree_insert_impl.h"
@@ -1182,7 +1310,7 @@ RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node
/* Insert the value to the leaf node */
static bool
-RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
uint64 key, uint64 value)
{
#define RT_NODE_LEVEL_LEAF
@@ -1194,18 +1322,31 @@ RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
* Create the radix tree in the given memory context and return it.
*/
RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa)
+#else
RT_CREATE(MemoryContext ctx)
+#endif
{
RT_RADIX_TREE *tree;
MemoryContext old_ctx;
+#ifdef RT_SHMEM
+ dsa_pointer dp;
+#endif
old_ctx = MemoryContextSwitchTo(ctx);
- tree = palloc(sizeof(RT_RADIX_TREE));
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
tree->context = ctx;
- tree->root = NULL;
- tree->max_val = 0;
- tree->num_keys = 0;
+
+#ifdef RT_SHMEM
+ tree->dsa = dsa;
+ dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+ tree->ctl->handle = dp;
+ tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+#else
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
/* Create the slab allocator for each size class */
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
@@ -1218,27 +1359,78 @@ RT_CREATE(MemoryContext ctx)
RT_SIZE_CLASS_INFO[i].name,
RT_SIZE_CLASS_INFO[i].leaf_blocksize,
RT_SIZE_CLASS_INFO[i].leaf_size);
-#ifdef RT_DEBUG
- tree->cnt[i] = 0;
-#endif
}
+#endif
+
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
MemoryContextSwitchTo(old_ctx);
return tree;
}
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+ RT_RADIX_TREE *tree;
+ dsa_pointer control;
+
+ /* XXX: memory context support */
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ tree->dsa = dsa;
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ /* XXX: do we need to set a callback on exit to detach dsa? */
+
+ return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ return tree->ctl->handle;
+}
+#endif
+
/*
* Free the given radix tree.
*/
RT_SCOPE void
RT_FREE(RT_RADIX_TREE *tree)
{
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->dsa, tree->ctl->handle); // XXX
+ //dsa_detach(tree->dsa);
+#else
+ pfree(tree->ctl);
+
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
MemoryContextDelete(tree->inner_slabs[i]);
MemoryContextDelete(tree->leaf_slabs[i]);
}
+#endif
pfree(tree);
}
@@ -1252,46 +1444,54 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
{
int shift;
bool updated;
- RT_PTR_LOCAL node;
RT_PTR_LOCAL parent;
+ RT_PTR_ALLOC nodep;
+ RT_PTR_LOCAL node;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
/* Empty tree, create the root */
- if (!tree->root)
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
RT_NEW_ROOT(tree, key);
/* Extend the tree if necessary */
- if (key > tree->max_val)
+ if (key > tree->ctl->max_val)
RT_EXTEND(tree, key);
- Assert(tree->root);
+ //Assert(tree->ctl->root);
- shift = tree->root->shift;
- node = parent = tree->root;
+ nodep = tree->ctl->root;
+ parent = RT_PTR_GET_LOCAL(tree, nodep);
+ shift = parent->shift;
/* Descend the tree until a leaf node */
while (shift >= 0)
{
- RT_PTR_LOCAL child;
+ RT_PTR_ALLOC child;
+
+ node = RT_PTR_GET_LOCAL(tree, nodep);
if (NODE_IS_LEAF(node))
break;
if (!RT_NODE_SEARCH_INNER(node, key, &child))
{
- RT_SET_EXTEND(tree, key, value, parent, node);
+ RT_SET_EXTEND(tree, key, value, parent, nodep, node);
return false;
}
parent = node;
- node = child;
+ nodep = child;
shift -= RT_NODE_SPAN;
}
- updated = RT_NODE_INSERT_LEAF(tree, parent, node, key, value);
+ updated = RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
/* Update the statistics */
if (!updated)
- tree->num_keys++;
+ tree->ctl->num_keys++;
return updated;
}
@@ -1307,13 +1507,16 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
RT_PTR_LOCAL node;
int shift;
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
Assert(value_p != NULL);
- if (!tree->root || key > tree->max_val)
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
return false;
- node = tree->root;
- shift = tree->root->shift;
+ node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ shift = node->shift;
/* Descend the tree until a leaf node */
while (shift >= 0)
@@ -1326,7 +1529,7 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
if (!RT_NODE_SEARCH_INNER(node, key, &child))
return false;
- node = child;
+ node = RT_PTR_GET_LOCAL(tree, child);
shift -= RT_NODE_SPAN;
}
@@ -1341,37 +1544,44 @@ RT_SCOPE bool
RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
{
RT_PTR_LOCAL node;
+ RT_PTR_ALLOC allocnode;
RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
int shift;
int level;
bool deleted;
- if (!tree->root || key > tree->max_val)
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
return false;
/*
* Descend the tree to search the key while building a stack of nodes we
* visited.
*/
- node = tree->root;
- shift = tree->root->shift;
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
level = -1;
while (shift > 0)
{
RT_PTR_ALLOC child;
/* Push the current node to the stack */
- stack[++level] = node;
+ stack[++level] = allocnode;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
if (!RT_NODE_SEARCH_INNER(node, key, &child))
return false;
- node = child;
+ allocnode = child;
shift -= RT_NODE_SPAN;
}
/* Delete the key from the leaf node if exists */
- Assert(NODE_IS_LEAF(node));
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
deleted = RT_NODE_DELETE_LEAF(node, key);
if (!deleted)
@@ -1381,7 +1591,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
}
/* Found the key to delete. Update the statistics */
- tree->num_keys--;
+ tree->ctl->num_keys--;
/*
* Return if the leaf node still has keys and we don't need to delete the
@@ -1391,13 +1601,14 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
return true;
/* Free the empty leaf node */
- RT_FREE_NODE(tree, node);
+ RT_FREE_NODE(tree, allocnode);
/* Delete the key in inner nodes recursively */
while (level >= 0)
{
- node = stack[level--];
+ allocnode = stack[level--];
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
deleted = RT_NODE_DELETE_INNER(node, key);
Assert(deleted);
@@ -1406,7 +1617,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
break;
/* The node became empty */
- RT_FREE_NODE(tree, node);
+ RT_FREE_NODE(tree, allocnode);
}
return true;
@@ -1478,6 +1689,7 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
{
MemoryContext old_ctx;
RT_ITER *iter;
+ RT_PTR_LOCAL root;
int top_level;
old_ctx = MemoryContextSwitchTo(tree->context);
@@ -1486,17 +1698,18 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
iter->tree = tree;
/* empty tree */
- if (!iter->tree->root)
+ if (!iter->tree->ctl->root)
return iter;
- top_level = iter->tree->root->shift / RT_NODE_SPAN;
+ root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+ top_level = root->shift / RT_NODE_SPAN;
iter->stack_len = top_level;
/*
* Descend to the left most leaf node from the root. The key is being
* constructed while descending to the leaf.
*/
- RT_UPDATE_ITER_STACK(iter, iter->tree->root, top_level);
+ RT_UPDATE_ITER_STACK(iter, root, top_level);
MemoryContextSwitchTo(old_ctx);
@@ -1511,7 +1724,7 @@ RT_SCOPE bool
RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
{
/* Empty tree */
- if (!iter->tree->root)
+ if (!iter->tree->ctl->root)
return false;
for (;;)
@@ -1571,7 +1784,7 @@ RT_END_ITERATE(RT_ITER *iter)
RT_SCOPE uint64
RT_NUM_ENTRIES(RT_RADIX_TREE *tree)
{
- return tree->num_keys;
+ return tree->ctl->num_keys;
}
/*
@@ -1580,13 +1793,19 @@ RT_NUM_ENTRIES(RT_RADIX_TREE *tree)
RT_SCOPE uint64
RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
{
+ // XXX is this necessary?
Size total = sizeof(RT_RADIX_TREE);
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ total = dsa_get_total_size(tree->dsa);
+#else
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
}
+#endif
return total;
}
@@ -1670,13 +1889,13 @@ void
rt_stats(RT_RADIX_TREE *tree)
{
ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
- tree->num_keys,
- tree->root->shift / RT_NODE_SPAN,
- tree->cnt[RT_CLASS_4_FULL],
- tree->cnt[RT_CLASS_32_PARTIAL],
- tree->cnt[RT_CLASS_32_FULL],
- tree->cnt[RT_CLASS_125_FULL],
- tree->cnt[RT_CLASS_256])));
+ tree->ctl->num_keys,
+ tree->ctl->root->shift / RT_NODE_SPAN,
+ tree->ctl->cnt[RT_CLASS_4_FULL],
+ tree->ctl->cnt[RT_CLASS_32_PARTIAL],
+ tree->ctl->cnt[RT_CLASS_32_FULL],
+ tree->ctl->cnt[RT_CLASS_125_FULL],
+ tree->ctl->cnt[RT_CLASS_256])));
}
static void
@@ -1848,23 +2067,23 @@ rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
elog(NOTICE, "-----------------------------------------------------------");
elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
- tree->max_val, tree->max_val);
+ tree->ctl->max_val, tree->ctl->max_val);
- if (!tree->root)
+ if (!tree->ctl->root)
{
elog(NOTICE, "tree is empty");
return;
}
- if (key > tree->max_val)
+ if (key > tree->ctl->max_val)
{
elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
key, key);
return;
}
- node = tree->root;
- shift = tree->root->shift;
+ node = tree->ctl->root;
+ shift = tree->ctl->root->shift;
while (shift >= 0)
{
RT_PTR_LOCAL child;
@@ -1901,15 +2120,15 @@ rt_dump(RT_RADIX_TREE *tree)
RT_SIZE_CLASS_INFO[i].inner_blocksize,
RT_SIZE_CLASS_INFO[i].leaf_size,
RT_SIZE_CLASS_INFO[i].leaf_blocksize);
- fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
- if (!tree->root)
+ if (!tree->ctl->root)
{
fprintf(stderr, "empty tree\n");
return;
}
- rt_dump_node(tree->root, 0, true);
+ rt_dump_node(tree->ctl->root, 0, true);
}
#endif
@@ -1928,9 +2147,14 @@ rt_dump(RT_RADIX_TREE *tree)
#undef VAR_NODE_HAS_FREE_SLOT
#undef FIXED_NODE_HAS_FREE_SLOT
#undef RT_SIZE_CLASS_COUNT
+#undef RT_RADIX_TREE_MAGIC
/* type declarations */
#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
#undef RT_ITER
#undef RT_NODE
#undef RT_NODE_ITER
@@ -1959,6 +2183,9 @@ rt_dump(RT_RADIX_TREE *tree)
/* function declarations */
#undef RT_CREATE
#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
#undef RT_SET
#undef RT_BEGIN_ITERATE
#undef RT_ITERATE_NEXT
@@ -1980,6 +2207,8 @@ rt_dump(RT_RADIX_TREE *tree)
#undef RT_GROW_NODE_KIND
#undef RT_COPY_NODE
#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
#undef RT_NODE_4_SEARCH_EQ
#undef RT_NODE_32_SEARCH_EQ
#undef RT_NODE_4_GET_INSERTPOS
@@ -2005,6 +2234,7 @@ rt_dump(RT_RADIX_TREE *tree)
#undef RT_SHIFT_GET_MAX_VAL
#undef RT_NODE_SEARCH_INNER
#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
#undef RT_NODE_DELETE_INNER
#undef RT_NODE_DELETE_LEAF
#undef RT_NODE_INSERT_INNER
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index 6eefc63e19..eb87866b90 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -16,6 +16,12 @@
uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(NODE_IS_LEAF(node));
+#else
+ Assert(!NODE_IS_LEAF(node));
+#endif
+
switch (node->kind)
{
case RT_NODE_KIND_4:
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index ff76583402..e4faf54d9d 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -14,11 +14,14 @@
uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
bool chunk_exists = false;
- RT_NODE *newnode = NULL;
+ RT_PTR_LOCAL newnode = NULL;
+ RT_PTR_ALLOC allocnode;
#ifdef RT_NODE_LEVEL_LEAF
+ const bool inner = false;
Assert(NODE_IS_LEAF(node));
#else
+ const bool inner = true;
Assert(!NODE_IS_LEAF(node));
#endif
@@ -45,9 +48,15 @@
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
{
RT_NODE32_TYPE *new32;
+ const uint8 new_kind = RT_NODE_KIND_32;
+ const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
/* grow node from 4 to 32 */
- newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, inner);
+ RT_COPY_NODE(newnode, node);
+ //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
new32 = (RT_NODE32_TYPE *) newnode;
#ifdef RT_NODE_LEVEL_LEAF
RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
@@ -57,7 +66,7 @@
new32->base.chunks, new32->children);
#endif
Assert(parent != NULL);
- RT_REPLACE_NODE(tree, parent, node, newnode, key);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
node = newnode;
}
else
@@ -112,17 +121,19 @@
n32->base.n.fanout == class32_min.fanout)
{
/* grow to the next size class of this kind */
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
+
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
#ifdef RT_NODE_LEVEL_LEAF
- newnode = RT_ALLOC_NODE(tree, RT_CLASS_32_FULL, false);
memcpy(newnode, node, class32_min.leaf_size);
#else
- newnode = RT_ALLOC_NODE(tree, RT_CLASS_32_FULL, true);
memcpy(newnode, node, class32_min.inner_size);
#endif
newnode->fanout = class32_max.fanout;
Assert(parent != NULL);
- RT_REPLACE_NODE(tree, parent, node, newnode, key);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
node = newnode;
/* also update pointer for this kind */
@@ -132,11 +143,17 @@
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
{
RT_NODE125_TYPE *new125;
+ const uint8 new_kind = RT_NODE_KIND_125;
+ const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
Assert(n32->base.n.fanout == class32_max.fanout);
/* grow node from 32 to 125 */
- newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, inner);
+ RT_COPY_NODE(newnode, node);
+ //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
new125 = (RT_NODE125_TYPE *) newnode;
for (int i = 0; i < class32_max.fanout; i++)
@@ -153,7 +170,7 @@
new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
Assert(parent != NULL);
- RT_REPLACE_NODE(tree, parent, node, newnode, key);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
node = newnode;
}
else
@@ -204,9 +221,15 @@
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
{
RT_NODE256_TYPE *new256;
+ const uint8 new_kind = RT_NODE_KIND_256;
+ const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
/* grow node from 125 to 256 */
- newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, inner);
+ RT_COPY_NODE(newnode, node);
+ //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
new256 = (RT_NODE256_TYPE *) newnode;
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
{
@@ -221,7 +244,7 @@
}
Assert(parent != NULL);
- RT_REPLACE_NODE(tree, parent, node, newnode, key);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
node = newnode;
}
else
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index a153011376..0b8b68df6c 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -12,13 +12,22 @@
#error node level must be either inner or leaf
#endif
+ bool found = false;
+ uint8 key_chunk;
+
#ifdef RT_NODE_LEVEL_LEAF
uint64 value;
+
+ Assert(NODE_IS_LEAF(node_iter->node));
#else
- RT_NODE *child = NULL;
+ RT_PTR_LOCAL child = NULL;
+
+ Assert(!NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+ Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
#endif
- bool found = false;
- uint8 key_chunk;
switch (node_iter->node->kind)
{
@@ -32,7 +41,7 @@
#ifdef RT_NODE_LEVEL_LEAF
value = n4->values[node_iter->current_idx];
#else
- child = n4->children[node_iter->current_idx];
+ child = RT_PTR_GET_LOCAL(iter->tree, n4->children[node_iter->current_idx]);
#endif
key_chunk = n4->base.chunks[node_iter->current_idx];
found = true;
@@ -49,7 +58,7 @@
#ifdef RT_NODE_LEVEL_LEAF
value = n32->values[node_iter->current_idx];
#else
- child = n32->children[node_iter->current_idx];
+ child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
#endif
key_chunk = n32->base.chunks[node_iter->current_idx];
found = true;
@@ -73,7 +82,7 @@
#ifdef RT_NODE_LEVEL_LEAF
value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
#else
- child = RT_NODE_INNER_125_GET_CHILD(n125, i);
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
#endif
key_chunk = i;
found = true;
@@ -101,7 +110,7 @@
#ifdef RT_NODE_LEVEL_LEAF
value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
#else
- child = RT_NODE_INNER_256_GET_CHILD(n256, i);
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
#endif
key_chunk = i;
found = true;
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index cbc357dcc8..3e97c31c2c 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -16,8 +16,13 @@
#ifdef RT_NODE_LEVEL_LEAF
uint64 value = 0;
+
+ Assert(NODE_IS_LEAF(node));
#else
- RT_PTR_LOCAL child = NULL;
+#ifndef RT_ACTION_UPDATE
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+#endif
+ Assert(!NODE_IS_LEAF(node));
#endif
switch (node->kind)
@@ -32,8 +37,12 @@
#ifdef RT_NODE_LEVEL_LEAF
value = n4->values[idx];
+#else
+#ifdef RT_ACTION_UPDATE
+ n4->children[idx] = new_child;
#else
child = n4->children[idx];
+#endif
#endif
break;
}
@@ -47,22 +56,31 @@
#ifdef RT_NODE_LEVEL_LEAF
value = n32->values[idx];
+#else
+#ifdef RT_ACTION_UPDATE
+ n32->children[idx] = new_child;
#else
child = n32->children[idx];
+#endif
#endif
break;
}
case RT_NODE_KIND_125:
{
RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
- if (!RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, chunk))
+ if (slotpos == RT_NODE_125_INVALID_IDX)
return false;
#ifdef RT_NODE_LEVEL_LEAF
value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+#ifdef RT_ACTION_UPDATE
+ n125->children[slotpos] = new_child;
#else
child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
#endif
break;
}
@@ -79,19 +97,25 @@
#ifdef RT_NODE_LEVEL_LEAF
value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+#ifdef RT_ACTION_UPDATE
+ RT_NODE_INNER_256_SET(n256, chunk, new_child);
#else
child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
#endif
break;
}
}
+#ifndef RT_ACTION_UPDATE
#ifdef RT_NODE_LEVEL_LEAF
Assert(value_p != NULL);
*value_p = value;
#else
Assert(child_p != NULL);
*child_p = child;
+#endif
#endif
return true;
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 104386e674..c67f936880 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 2256d08100..61d842789d 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -18,6 +18,7 @@
#include "nodes/bitmapset.h"
#include "storage/block.h"
#include "storage/itemptr.h"
+#include "storage/lwlock.h"
#include "utils/memutils.h"
#include "utils/timestamp.h"
@@ -103,6 +104,8 @@ static const test_spec test_specs[] = {
#define RT_SCOPE static
#define RT_DECLARE
#define RT_DEFINE
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
#include "lib/radixtree.h"
@@ -119,7 +122,15 @@ test_empty(void)
uint64 key;
uint64 val;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
radixtree = rt_create(CurrentMemoryContext);
+#endif
if (rt_search(radixtree, 0, &dummy))
elog(ERROR, "rt_search on empty tree returned true");
@@ -153,10 +164,20 @@ test_basic(int children, bool test_inner)
uint64 *keys;
int shift = test_inner ? 8 : 0;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
elog(NOTICE, "testing basic operations with %s node %d",
test_inner ? "inner" : "leaf", children);
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
radixtree = rt_create(CurrentMemoryContext);
+#endif
/* prepare keys in order like 1, 32, 2, 31, 2, ... */
keys = palloc(sizeof(uint64) * children);
@@ -297,9 +318,19 @@ test_node_types(uint8 shift)
{
rt_radix_tree *radixtree;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
radixtree = rt_create(CurrentMemoryContext);
+#endif
/*
* Insert and search entries for every node type at the 'shift' level,
@@ -332,6 +363,11 @@ test_pattern(const test_spec * spec)
int patternlen;
uint64 *pattern_values;
uint64 pattern_num_values;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
if (rt_test_stats)
@@ -357,7 +393,13 @@ test_pattern(const test_spec * spec)
"radixtree test",
ALLOCSET_SMALL_SIZES);
MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(radixtree_ctx, dsa);
+#else
radixtree = rt_create(radixtree_ctx);
+#endif
+
/*
* Add values to the set.
@@ -563,6 +605,7 @@ test_pattern(const test_spec * spec)
elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
nafter, (nbefore - ndeleted), ndeleted);
+ rt_free(radixtree);
MemoryContextDelete(radixtree_ctx);
}
--
2.39.0
v18-0010-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchapplication/x-patch; name=v18-0010-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload
From b2d17b4649e1a5f1de5d8f598ae5c1a5c220d85e Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 13 Jan 2023 15:38:59 +0700
Subject: [PATCH v18 10/10] Use TIDStore for storing dead tuple TID during lazy
vacuum
Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which is not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.
This changes to use TIDStore for this purpose. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.
Also, since we are no longer able to exactly estimate the maximum
number of TIDs can be stored based on the amount of memory. It also
changes to the column names max_dead_tuples and num_dead_tuples and to
show the progress information in bytes.
Furthermore, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, which is the
inital DSA segment size. Due to that, this change increase the minimum
maintenance_work_mem from 1MB to 2MB.
---
doc/src/sgml/monitoring.sgml | 8 +-
src/backend/access/heap/vacuumlazy.c | 168 +++++++--------------
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 76 +---------
src/backend/commands/vacuumparallel.c | 64 +++++---
src/backend/storage/lmgr/lwlock.c | 2 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/commands/progress.h | 4 +-
src/include/commands/vacuum.h | 25 +--
src/include/storage/lwlock.h | 1 +
src/test/regress/expected/cluster.out | 2 +-
src/test/regress/expected/create_index.out | 2 +-
src/test/regress/expected/rules.out | 4 +-
src/test/regress/sql/cluster.sql | 2 +-
src/test/regress/sql/create_index.sql | 2 +-
15 files changed, 122 insertions(+), 242 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 358d2ff90f..6ce7ea9e35 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6840,10 +6840,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>max_dead_tuples</structfield> <type>bigint</type>
+ <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples that we can store before needing to perform
+ Amount of dead tuple data that we can store before needing to perform
an index vacuum cycle, based on
<xref linkend="guc-maintenance-work-mem"/>.
</para></entry>
@@ -6851,10 +6851,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>num_dead_tuples</structfield> <type>bigint</type>
+ <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples collected since the last index vacuum cycle.
+ Amount of dead tuple data collected since the last index vacuum cycle.
</para></entry>
</row>
</tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3694515167..58e87c4528 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TidStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -259,8 +260,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer *vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer *vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -825,21 +827,21 @@ lazy_scan_heap(LVRelState *vacrel)
blkno,
next_unskippable_block,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TidStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
const int initprog_index[] = {
PROGRESS_VACUUM_PHASE,
PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
- PROGRESS_VACUUM_MAX_DEAD_TUPLES
+ PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
};
int64 initprog_val[3];
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +908,7 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ if (tidstore_is_full(vacrel->dead_items))
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -1039,11 +1040,18 @@ lazy_scan_heap(LVRelState *vacrel)
if (prunestate.has_lpdead_items)
{
Size freespace;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ result = tidstore_iterate_next(iter);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+ buf, &vmbuffer);
+ Assert(!tidstore_iterate_next(iter));
+ tidstore_end_iterate(iter);
/* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ tidstore_reset(dead_items);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1080,7 +1088,7 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(tidstore_num_tids(dead_items) == 0);
}
/*
@@ -1233,7 +1241,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (tidstore_num_tids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1871,23 +1879,15 @@ retry:
*/
if (lpdead_items > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TidStore *dead_items = vacrel->dead_items;
vacrel->lpdead_item_pages++;
prunestate->has_lpdead_items = true;
- ItemPointerSetBlockNumber(&tmp, blkno);
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
/*
* It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -2107,8 +2107,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TidStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2117,17 +2116,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2176,7 +2168,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
return;
}
@@ -2205,7 +2197,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2232,8 +2224,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2278,7 +2270,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
}
/*
@@ -2351,7 +2343,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2388,10 +2380,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index = 0;
BlockNumber vacuumed_pages = 0;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2406,7 +2399,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
VACUUM_ERRCB_PHASE_VACUUM_HEAP,
InvalidBlockNumber, InvalidOffsetNumber);
- while (index < vacrel->dead_items->num_items)
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ while ((result = tidstore_iterate_next(iter)) != NULL)
{
BlockNumber blkno;
Buffer buf;
@@ -2415,12 +2409,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ blkno = result->blkno;;
vacrel->blkno = blkno;
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, &vmbuffer);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+ buf, &vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2430,6 +2425,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
vacuumed_pages++;
}
+ tidstore_end_iterate(iter);
vacrel->blkno = InvalidBlockNumber;
if (BufferIsValid(vmbuffer))
@@ -2439,14 +2435,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2463,11 +2458,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* LP_DEAD item on the page. The return value is the first index immediately
* after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer *vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets, Buffer buffer, Buffer *vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int nunused = 0;
@@ -2486,16 +2480,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = offsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2575,7 +2564,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -3071,46 +3059,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
return vacrel->nonempty_pages;
}
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
/*
* Allocate dead_items (either using palloc, or in dynamic shared memory).
* Sets dead_items in vacrel for caller.
@@ -3121,11 +3069,9 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
+ int vac_work_mem = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
@@ -3152,7 +3098,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
+ vac_work_mem,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3165,11 +3111,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = tidstore_create(vac_work_mem, NULL);
}
/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 447c9b970f..133e03d728 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1165,7 +1165,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
END AS phase,
S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
- S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+ S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index c4ed7efce3..7de4350cde 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -95,7 +95,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int vac_cmp_itemptr(const void *left, const void *right);
/*
* Primary entry point for manual VACUUM and ANALYZE commands
@@ -2298,16 +2297,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TidStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ tidstore_num_tids(dead_items))));
return istat;
}
@@ -2338,18 +2337,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
@@ -2360,60 +2347,7 @@ vac_max_items_to_alloc_size(int max_items)
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch((void *) itemptr,
- (void *) dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
-
- return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
- BlockNumber lblk,
- rblk;
- OffsetNumber loff,
- roff;
-
- lblk = ItemPointerGetBlockNumber((ItemPointer) left);
- rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
- if (lblk < rblk)
- return -1;
- if (lblk > rblk)
- return 1;
-
- loff = ItemPointerGetOffsetNumber((ItemPointer) left);
- roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
- if (loff < roff)
- return -1;
- if (loff > roff)
- return 1;
+ TidStore *dead_items = (TidStore *) state;
- return 0;
+ return tidstore_lookup_tid(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..4c0ce4b7e6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
* use small integers.
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 3
#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 4
#define PARALLEL_VACUUM_KEY_WAL_USAGE 5
@@ -103,6 +103,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TidStore */
+ tidstore_handle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ dsa_area *dead_items_area;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = tidstore_create(vac_work_mem, dead_items_dsa);
+ pvs->dead_items = dead_items;
+ pvs->dead_items_area = dead_items_dsa;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = tidstore_get_handle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ tidstore_destroy(pvs->dead_items);
+ dsa_detach(pvs->dead_items_area);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TidStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ tidstore_detach(pvs.dead_items);
+ dsa_detach(dead_items_area);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 196bece0a3..ff75fae88a 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -186,6 +186,8 @@ static const char *const BuiltinTrancheNames[] = {
"PgStatsHash",
/* LWTRANCHE_PGSTATS_DATA: */
"PgStatsData",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 5025e80f89..edee8a2b2b 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2301,7 +2301,7 @@ struct config_int ConfigureNamesInt[] =
GUC_UNIT_KB
},
&maintenance_work_mem,
- 65536, 1024, MAX_KILOBYTES,
+ 65536, 2048, MAX_KILOBYTES,
NULL, NULL, NULL
},
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
#define PROGRESS_VACUUM_HEAP_BLKS_SCANNED 2
#define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED 3
#define PROGRESS_VACUUM_NUM_INDEX_VACUUMS 4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES 5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES 5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES 6
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..220d89fff7 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
MultiXactId MultiXactCutoff;
};
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TidStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int vac_work_mem,
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index e4162db613..40dda03088 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -204,6 +204,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DSA,
LWTRANCHE_PGSTATS_HASH,
LWTRANCHE_PGSTATS_DATA,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 542c2e098c..e678e6f79e 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -524,7 +524,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
-- ensure we don't use the index in CLUSTER nor the checking SELECTs
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
-- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..0c49354f04 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param3 AS heap_blks_scanned,
s.param4 AS heap_blks_vacuumed,
s.param5 AS index_vacuum_count,
- s.param6 AS max_dead_tuples,
- s.param7 AS num_dead_tuples
+ s.param6 AS max_dead_tuple_bytes,
+ s.param7 AS dead_tuple_bytes
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT s.stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index 6cb9c926c0..a795d705d5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -256,7 +256,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
--
2.39.0
On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Thu, Jan 12, 2023 at 9:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Jan 12, 2023 at 5:21 PM John Naylor
<john.naylor@enterprisedb.com> wrote:Okay, I'll squash the previous patch and work on cleaning up the internals. I'll keep the external APIs the same so that your work on vacuum integration can be easily rebased on top of that, and we can work independently.
There were some conflicts with HEAD, so to keep the CF bot busy, I've quickly put together v18. I still have a lot of cleanup work to do, but this is enough for now.
Thanks! cfbot complaints about some warnings but these are expected
(due to unused delete routines etc). But one reported error[1]https://cirrus-ci.com/task/5078505327689728 might
be relevant with 0002 patch?
[05:44:11.759] "link" /MACHINE:x64
/OUT:src/test/modules/test_radixtree/test_radixtree.dll
src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.res
src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj
"/nologo" "/release" "/nologo" "/DEBUG"
"/PDB:src/test\modules\test_radixtree\test_radixtree.pdb" "/DLL"
"/IMPLIB:src/test\modules\test_radixtree\test_radixtree.lib"
"/INCREMENTAL:NO" "/STACK:4194304" "/NOEXP" "/DEBUG:FASTLINK"
"/NOIMPLIB" "C:/cirrus/build/src/backend/postgres.exe.lib"
"wldap32.lib" "c:/openssl/1.1/lib/libssl.lib"
"c:/openssl/1.1/lib/libcrypto.lib" "ws2_32.lib" "kernel32.lib"
"user32.lib" "gdi32.lib" "winspool.lib" "shell32.lib" "ole32.lib"
"oleaut32.lib" "uuid.lib" "comdlg32.lib" "advapi32.lib"
[05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
external symbol pg_popcount64
[05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
fatal error LNK1120: 1 unresolved externals
0003 contains all v17 local-memory coding squashed together.
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
This comment seems to be out-of-date since we made it a template.
---
+#ifndef RT_COMMON
+#define RT_COMMON
What are we using this macro RT_COMMON for?
---
The following macros are defined but not undefined in radixtree.h:
RT_MAKE_PREFIX
RT_MAKE_NAME
RT_MAKE_NAME_
RT_SEARCH
UINT64_FORMAT_HEX
RT_NODE_SPAN
RT_NODE_MAX_SLOTS
RT_CHUNK_MASK
RT_MAX_SHIFT
RT_MAX_LEVEL
RT_NODE_125_INVALID_IDX
RT_GET_KEY_CHUNK
BM_IDX
BM_BIT
RT_NODE_KIND_4
RT_NODE_KIND_32
RT_NODE_KIND_125
RT_NODE_KIND_256
RT_NODE_KIND_COUNT
RT_PTR_LOCAL
RT_PTR_ALLOC
RT_INVALID_PTR_ALLOC
NODE_SLAB_BLOCK_SIZE
0004 perf test not updated but it doesn't build by default so it's fine for now
Okay.
0005 removes node.chunk as discussed, but does not change node4 fanout yet.
LGTM.
0006 is a small cleanup regarding setting node fanout.
LGTM.
0007 squashes my shared memory work with Masahiko's fixes from the addendum v17-0010.
+ /* XXX: do we need to set a callback on exit to detach dsa? */
In the current shared radix tree design, it's a caller responsible
that they create (or attach to) a DSA area and pass it to RT_CREATE()
or RT_ATTACH(). It enables us to use one DSA not only for the radix
tree but also other data. Which is more flexible. So the caller needs
to detach from the DSA somehow, so I think we don't need to set a
callback here for that.
---
+ dsa_free(tree->dsa, tree->ctl->handle); // XXX
+ //dsa_detach(tree->dsa);
Similar to above, I think we should not detach from the DSA area here.
Given that the DSA area used by the radix tree could be used also by
other data, I think that in RT_FREE() we need to free each radix tree
node allocated in DSA. In lazy vacuum, we check the memory usage
instead of the number of TIDs and need to reset the TidScan after an
index scan. So it does RT_FREE() and dsa_trim() to return DSM segments
to the OS. I've implemented rt_free_recurse() for this purpose in the
v15 version patch.
--
- Assert(tree->root);
+ //Assert(tree->ctl->root);
I think we don't need this assertion in the first place. We check it
at the beginning of the function.
---
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(NODE_IS_LEAF(node));
+#else
+ Assert(!NODE_IS_LEAF(node));
+#endif
+
I think we can move this change to 0003 patch.
0008 turns the existence checks in RT_NODE_UPDATE_INNER into Asserts, as discussed.
LGTM.
0009/0010 are just copies of Masauiko's v17 addendum v17-0011/12, but the latter rebased over recent variable renaming (it's possible I missed something, so worth checking).
I've implemented the idea of using union. Let me share WIP code for
discussion, I've attached three patches that can be applied on top ofSeems fine as far as the union goes. Let's go ahead with this, and make progress on locking etc.
+1
Overall, TidStore implementation with the union idea doesn't look so
ugly to me. But I got many compiler warning about unused radix tree
functions like:tidstore.c:99:19: warning: 'shared_rt_delete' defined but not used
[-Wunused-function]I'm not sure there is a convenient way to suppress this warning but
one idea is to have some macros to specify what operations are
enabled/declared.That sounds like a good idea. It's also worth wondering if we even need RT_NUM_ENTRIES at all, since the caller is capable of keeping track of that if necessary. It's also misnamed, since it's concerned with the number of keys. The vacuum case cares about the number of TIDs, and not number of (encoded) keys. Even if we ever (say) changed the key to blocknumber and value to Bitmapset, the number of keys might not be interesting.
Right. In fact, TIdStore doesn't use RT_NUM_ENTRIES.
It sounds like we should at least make the delete functionality optional. (Side note on optional functions: if an implementation didn't care about iteration or its order, we could optimize insertion into linear nodes)
Agreed.
Since this is WIP, you may already have some polish in mind, so I won't go over the patches in detail, but I wanted to ask about a few things (numbers referring to v17 addendum, not v18):
0011
+ * 'num_tids' is the number of Tids stored so far. 'max_byte' is the maximum + * bytes a TidStore can use. These two fields are commonly used in both + * non-shared case and shared case. + */ + uint32 num_tids;uint32 is how we store the block number, so this too small and will wrap around on overflow. int64 seems better.
Agreed, will fix.
+ * We calculate the maximum bytes for the TidStore in different ways + * for non-shared case and shared case. Please refer to the comment + * TIDSTORE_MEMORY_DEDUCT for details. + */Maybe the #define and comment should be close to here.
Will fix.
+ * Destroy a TidStore, returning all memory. The caller must be certain that + * no other backend will attempt to access the TidStore before calling this + * function. Other backend must explicitly call tidstore_detach to free up + * backend-local memory associated with the TidStore. The backend that calls + * tidstore_destroy must not call tidstore_detach. + */ +void +tidstore_destroy(TidStore *ts)If not addressed by next patch, need to phrase comment with FIXME or TODO about making certain.
Will fix.
+ * Add Tids on a block to TidStore. The caller must ensure the offset numbers + * in 'offsets' are ordered in ascending order.Must? What happens otherwise?
It ends up missing TIDs by overwriting the same key with different
values. Is it better to have a bool argument, say need_sort, to sort
the given array if the caller wants?
+ uint64 last_key = PG_UINT64_MAX;
I'm having some difficulty understanding this sentinel and how it's used.
Will improve the logic.
@@ -1039,11 +1040,18 @@ lazy_scan_heap(LVRelState *vacrel) if (prunestate.has_lpdead_items) { Size freespace; + TidStoreIter *iter; + TidStoreIterResult *result;- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer); + iter = tidstore_begin_iterate(vacrel->dead_items); + result = tidstore_iterate_next(iter); + lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets, + buf, &vmbuffer); + Assert(!tidstore_iterate_next(iter)); + tidstore_end_iterate(iter);/* Forget the LP_DEAD items that we just vacuumed */ - dead_items->num_items = 0; + tidstore_reset(dead_items);This part only runs "if (vacrel->nindexes == 0)", so seems like unneeded complexity. It arises because lazy_scan_prune() populates the tid store even if no index vacuuming happens. Perhaps the caller of lazy_scan_prune() could pass the deadoffsets array, and upon returning, either populate the store or call lazy_vacuum_heap_page(), as needed. It's quite possible I'm missing some detail, so some description of the design choices made would be helpful.
I agree that we don't need complexity here. I'll try this idea.
On Mon, Jan 16, 2023 at 9:53 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've written a simple script to simulate the DSA memory usage and the
limit. The 75% limit works fine for a power of two cases, and we can
use the 60% limit for other cases (it seems we can use up to about 66%
but used 60% for safety). It would be best if we can mathematically
prove it but I could prove only the power of two cases. But the script
practically shows the 60% threshold would work for these cases.Okay. It's worth highlighting this in the comments, and also the fact that it depends on internal details of how DSA increases segment size.
Agreed.
Since it seems you're working on another cleanup, I can address the
above comments after your work is completed. But I'm also fine with
including them into your cleanup work.
Regards,
[1]: https://cirrus-ci.com/task/5078505327689728
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
Thanks! cfbot complaints about some warnings but these are expected
(due to unused delete routines etc). But one reported error[1] might
be relevant with 0002 patch?
[05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
external symbol pg_popcount64
[05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
fatal error LNK1120: 1 unresolved externals
Yeah, I'm not sure what's causing that. Since that comes from a debugging
function, we could work around it, but it would be nice to understand why,
so I'll probably have to experiment on my CI repo.
--- +#ifndef RT_COMMON +#define RT_COMMONWhat are we using this macro RT_COMMON for?
It was a quick way to define some things only once, so they probably all
showed up in the list of things you found not undefined. It's different
from the style of simplehash.h, which is to have a local name and #undef
for every single thing. simplehash.h is a precedent, so I'll change it to
match. I'll take a look at your list, too.
+ * Add Tids on a block to TidStore. The caller must ensure the offset
numbers
+ * in 'offsets' are ordered in ascending order.
Must? What happens otherwise?
It ends up missing TIDs by overwriting the same key with different
values. Is it better to have a bool argument, say need_sort, to sort
the given array if the caller wants?
Since it seems you're working on another cleanup, I can address the
above comments after your work is completed. But I'm also fine with
including them into your cleanup work.
I think we can work mostly simultaneously, if you work on tid store and
vacuum, and I work on the template. We can always submit a full patchset
including each other's latest work. That will catch rebase issues sooner.
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
Attached is an update that mostly has the modest goal of getting CI green
again. v19-0003 has squashed the entire radix tree template from
previously. I've kept out the perf test module for now -- still needs
updating.
[05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
external symbol pg_popcount64
[05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
fatal error LNK1120: 1 unresolved externalsYeah, I'm not sure what's causing that. Since that comes from a debugging
function, we could work around it, but it would be nice to understand why,
so I'll probably have to experiment on my CI repo.
I'm still confused by this error, because it only occurs in the test
module. I successfully built with just 0002 in CI so elsewhere where bmw_*
symbols resolve just fine on all platforms. I've worked around the error in
v19-0004 by using the general-purpose pg_popcount() function. We only need
to count bits in assert builds, so it doesn't matter a whole lot.
+ /* XXX: do we need to set a callback on exit to detach dsa? */
In the current shared radix tree design, it's a caller responsible
that they create (or attach to) a DSA area and pass it to RT_CREATE()
or RT_ATTACH(). It enables us to use one DSA not only for the radix
tree but also other data. Which is more flexible. So the caller needs
to detach from the DSA somehow, so I think we don't need to set a
callback here for that.--- + dsa_free(tree->dsa, tree->ctl->handle); // XXX + //dsa_detach(tree->dsa);Similar to above, I think we should not detach from the DSA area here.
Given that the DSA area used by the radix tree could be used also by
other data, I think that in RT_FREE() we need to free each radix tree
node allocated in DSA. In lazy vacuum, we check the memory usage
instead of the number of TIDs and need to reset the TidScan after an
index scan. So it does RT_FREE() and dsa_trim() to return DSM segments
to the OS. I've implemented rt_free_recurse() for this purpose in the
v15 version patch.-- - Assert(tree->root); + //Assert(tree->ctl->root);I think we don't need this assertion in the first place. We check it
at the beginning of the function.
I've removed these in v19-0006.
That sounds like a good idea. It's also worth wondering if we even need
RT_NUM_ENTRIES at all, since the caller is capable of keeping track of that
if necessary. It's also misnamed, since it's concerned with the number of
keys. The vacuum case cares about the number of TIDs, and not number of
(encoded) keys. Even if we ever (say) changed the key to blocknumber and
value to Bitmapset, the number of keys might not be interesting.
Right. In fact, TIdStore doesn't use RT_NUM_ENTRIES.
I've moved it to the test module, which uses it extensively. There, the
name is more clear what it's for, so I didn't change the name.
It sounds like we should at least make the delete functionality
optional. (Side note on optional functions: if an implementation didn't
care about iteration or its order, we could optimize insertion into linear
nodes)
Agreed.
Done in v19-0007.
v19-0009 is just a rebase over some more vacuum cleanups.
I'll continue working on internals cleanup.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v19-0001-introduce-vector8_min-and-vector8_highbit_mask.patchtext/x-patch; charset=US-ASCII; name=v19-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload
From 2cff749da71a4e581e762aac7587ec6463a1dd3d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v19 1/9] introduce vector8_min and vector8_highbit_mask
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..84d41a340a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
static inline bool vector8_has_zero(const Vector8 v);
static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
#endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
#endif
}
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+ uint32 mask = 0;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+ return mask;
+#endif
+}
+
/*
* Exactly like vector8_is_highbit_set except for the input type, so it
* looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.39.0
v19-0005-Remove-RT_NUM_ENTRIES.patchtext/x-patch; charset=US-ASCII; name=v19-0005-Remove-RT_NUM_ENTRIES.patchDownload
From d801347976bdc6489c66dcaf64dfed343bed39dc Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 17 Jan 2023 16:38:09 +0700
Subject: [PATCH v19 5/9] Remove RT_NUM_ENTRIES
This is not expected to be used everywhere, and is very simple
to implement, so move definition to test module where it is
used extensively.
---
src/include/lib/radixtree.h | 13 -------------
src/test/modules/test_radixtree/test_radixtree.c | 9 +++++++++
2 files changed, 9 insertions(+), 13 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 7f928f02d6..ba326562d5 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -63,7 +63,6 @@
* RT_ITERATE_NEXT - Return next key-value pair, if any
* RT_END_ITER - End iteration
* RT_MEMORY_USAGE - Get the memory usage
- * RT_NUM_ENTRIES - Get the number of key-value pairs
*
* RT_CREATE() creates an empty radix tree in the given memory context
* and memory contexts for all kinds of radix tree node under the memory context.
@@ -109,7 +108,6 @@
#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
#define RT_DELETE RT_MAKE_NAME(delete)
#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
-#define RT_NUM_ENTRIES RT_MAKE_NAME(num_entries)
#define RT_DUMP RT_MAKE_NAME(dump)
#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
#define RT_STATS RT_MAKE_NAME(stats)
@@ -222,7 +220,6 @@ RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
-RT_SCOPE uint64 RT_NUM_ENTRIES(RT_RADIX_TREE *tree);
#ifdef RT_DEBUG
RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
@@ -1773,15 +1770,6 @@ RT_END_ITERATE(RT_ITER *iter)
pfree(iter);
}
-/*
- * Return the number of keys in the radix tree.
- */
-RT_SCOPE uint64
-RT_NUM_ENTRIES(RT_RADIX_TREE *tree)
-{
- return tree->ctl->num_keys;
-}
-
/*
* Return the statistics of the amount of memory used by the radix tree.
*/
@@ -2185,7 +2173,6 @@ rt_dump(RT_RADIX_TREE *tree)
#undef RT_END_ITERATE
#undef RT_DELETE
#undef RT_MEMORY_USAGE
-#undef RT_NUM_ENTRIES
#undef RT_DUMP
#undef RT_DUMP_SEARCH
#undef RT_STATS
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 61d842789d..076173f628 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -109,6 +109,15 @@ static const test_spec test_specs[] = {
#include "lib/radixtree.h"
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
PG_MODULE_MAGIC;
PG_FUNCTION_INFO_V1(test_radixtree);
--
2.39.0
v19-0004-Workaround-link-errors-on-Windows-CI.patchtext/x-patch; charset=US-ASCII; name=v19-0004-Workaround-link-errors-on-Windows-CI.patchDownload
From 1413044ac1546ea3c940c1bdaa69083bfa417f98 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 17 Jan 2023 15:45:39 +0700
Subject: [PATCH v19 4/9] Workaround link errors on Windows CI
For some reason, using bmw_popcount() here leads to
link errors, although bmw_rightmost_one_pos() works
fine.
---
src/include/lib/radixtree.h | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 9f8bed09f7..7f928f02d6 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1863,12 +1863,10 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
if (NODE_IS_LEAF(node))
{
RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
- int cnt = 0;
-
- for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
- cnt += bmw_popcount(n256->isset[i]);
+ int cnt;
/* Check if the number of used chunk matches */
+ cnt = pg_popcount((const char *) n256->isset, sizeof(n256->isset));
Assert(n256->base.n.count == cnt);
break;
--
2.39.0
v19-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v19-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload
From 77251267b2c2a9123cdd7c2fe03907c45607cf7f Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v19 2/9] Move some bitmap logic out of bitmapset.c
Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.
Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().
Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.
Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
src/backend/nodes/bitmapset.c | 36 ++------------------------------
src/include/nodes/bitmapset.h | 16 ++++++++++++--
src/include/port/pg_bitutils.h | 31 +++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 -
4 files changed, 47 insertions(+), 37 deletions(-)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x))
/*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
{
int result;
- w = RIGHTMOST_ONE(w);
+ w = bmw_rightmost_one(w);
a->words[wordnum] &= ~w;
result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 0dca6bc5fa..80e91fac0f 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
#else
#define BITS_PER_BITMAPWORD 32
typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
#endif
@@ -75,6 +73,20 @@ typedef enum
BMS_MULTIPLE /* >1 member */
} BMS_Membership;
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w) pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
+#define bmw_popcount(w) pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w) pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
+#define bmw_popcount(w) pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
/*
* function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+ int32 result = (int32) word & -((int32) word);
+ return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+ int64 result = (int64) word & -((int64) word);
+ return (uint64) result;
+}
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 23bafec5f7..5bd3da4948 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3662,7 +3662,6 @@ shmem_request_hook_type
shmem_startup_hook_type
sig_atomic_t
sigjmp_buf
-signedbitmapword
sigset_t
size_t
slist_head
--
2.39.0
v19-0003-Add-radixtree-template.patchtext/x-patch; charset=US-ASCII; name=v19-0003-Add-radixtree-template.patchDownload
From 3e74bae1c7a27bd9c91c16f433614c1a7563d6de Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v19 3/9] Add radixtree template
The only thing configurable at this point is function scope,
prefix, and local/shared memory.
The key and value type are still hard-coded to uint64.
To make this more useful, at least value type should be
configurable.
It might be good at some point to offer a different tree type,
e.g. "single-value leaves" to allow for variable length keys
and values, giving full flexibility to developers.
TODO: Reducing the smallest node to 3 members will
eliminate padding and only take up 32 bytes for
inner nodes.
---
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 2243 +++++++++++++++++
src/include/lib/radixtree_delete_impl.h | 106 +
src/include/lib/radixtree_insert_impl.h | 316 +++
src/include/lib/radixtree_iter_impl.h | 138 +
src/include/lib/radixtree_search_impl.h | 131 +
src/include/utils/dsa.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 36 +
src/test/modules/test_radixtree/meson.build | 34 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 631 +++++
.../test_radixtree/test_radixtree.control | 4 +
src/tools/pginclude/cpluspluscheck | 6 +
src/tools/pginclude/headerscheck | 6 +
20 files changed, 3715 insertions(+)
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/include/lib/radixtree_delete_impl.h
create mode 100644 src/include/lib/radixtree_insert_impl.h
create mode 100644 src/include/lib/radixtree_iter_impl.h
create mode 100644 src/include/lib/radixtree_search_impl.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 604b702a91..50f0aae3ab 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..9f8bed09f7
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2243 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves". We
+ * choose it to avoid an additional pointer traversal. It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included. Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * will result in radix tree type 'foo_radix_tree' and functions like
+ * 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ * generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ * declarations reside
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ * so that multiple processes can access it simultaneously.
+ *
+ * Optional parameters:
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE - Create a new, empty radix tree
+ * RT_FREE - Free the radix tree
+ * RT_ATTACH - Attach to the radix tree
+ * RT_DETACH - Detach from the radix tree
+ * RT_GET_HANDLE - Return the handle of the radix tree
+ * RT_SEARCH - Search a key-value pair
+ * RT_SET - Set a key-value pair
+ * RT_DELETE - Delete a key-value pair
+ * RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT - Return next key-value pair, if any
+ * RT_END_ITER - End iteration
+ * RT_MEMORY_USAGE - Get the memory usage
+ * RT_NUM_ENTRIES - Get the number of key-value pairs
+ *
+ * RT_CREATE() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * RT_ITERATE_NEXT() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#define RT_DELETE RT_MAKE_NAME(delete)
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#define RT_NUM_ENTRIES RT_MAKE_NAME(num_entries)
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+//#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_4_SEARCH_EQ RT_MAKE_NAME(node_4_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_4_GET_INSERTPOS RT_MAKE_NAME(node_4_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_4 RT_MAKE_NAME(node_inner_4)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_4 RT_MAKE_NAME(node_leaf_4)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_4_FULL RT_MAKE_NAME(class_4_full)
+#define RT_CLASS_32_PARTIAL RT_MAKE_NAME(class_32_partial)
+#define RT_CLASS_32_FULL RT_MAKE_NAME(class_32_full)
+#define RT_CLASS_125_FULL RT_MAKE_NAME(class_125_full)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+#define RT_KIND_MIN_SIZE_CLASS RT_MAKE_NAME(kind_min_size_class)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+RT_SCOPE uint64 RT_NUM_ENTRIES(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif /* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* macros and types common to all implementations */
+#ifndef RT_COMMON
+#define RT_COMMON
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
+#define BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of RT_NODE. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+#endif /* RT_COMMON */
+
+
+typedef enum RT_SIZE_CLASS
+{
+ RT_CLASS_4_FULL = 0,
+ RT_CLASS_32_PARTIAL,
+ RT_CLASS_32_FULL,
+ RT_CLASS_125_FULL,
+ RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Max capacity for the current size class. Storing this in the
+ * node enables multiple size classes per node kind.
+ * Technically, kinds with a single size class don't need this, so we could
+ * keep this in the individual base types, but the code is simpler this way.
+ * Note: node256 is unique in that it cannot possibly have more than a
+ * single size class, so for that kind we store zero, and uint8 is
+ * sufficient for other kinds.
+ */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#define NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
+#define NODE_IS_EMPTY(n) (((RT_PTR_LOCAL) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+ ((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+ ((node)->base.n.count < RT_SIZE_CLASS_INFO[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct RT_NODE_BASE_4
+{
+ RT_NODE n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+} RT_NODE_BASE_4;
+
+typedef struct RT_NODE_BASE_32
+{
+ RT_NODE n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct RT_NODE_BASE_125
+{
+ RT_NODE n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[BM_IDX(128)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+ RT_NODE n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ * width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct RT_NODE_INNER_4
+{
+ RT_NODE_BASE_4 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_4;
+
+typedef struct RT_NODE_LEAF_4
+{
+ RT_NODE_BASE_4 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_4;
+
+typedef struct RT_NODE_INNER_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct RT_NODE_INNER_256
+{
+ RT_NODE_BASE_256 base;
+
+ /* Slots for 256 children */
+ RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+ RT_NODE_BASE_256 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[BM_IDX(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ uint64 values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+
+ /* slab block size */
+ Size inner_blocksize;
+ Size leaf_blocksize;
+} RT_SIZE_CLASS_ELEM;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+ [RT_CLASS_4_FULL] = {
+ .name = "radix tree node 4",
+ .fanout = 4,
+ .inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_PARTIAL] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_FULL] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64)),
+ },
+ [RT_CLASS_125_FULL] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64)),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(RT_NODE_INNER_256),
+ .leaf_size = sizeof(RT_NODE_LEAF_256),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_256)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_256)),
+ },
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+/* Map from the node kind to its minimum size class */
+static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
+ [RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+ [RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+ [RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+ [RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+ RT_HANDLE handle;
+ uint32 magic;
+#endif
+
+ RT_PTR_ALLOC root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE
+{
+ MemoryContext context;
+
+ /* pointing to either local memory or DSA */
+ RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+ dsa_area *dsa;
+#else
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
++ *
++ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
++ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
++ * We need either a safeguard to disallow other processes to begin the iteration
++ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+ RT_PTR_LOCAL node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+ RT_RADIX_TREE *tree;
+
+ /* Track the iteration on nodes of each level */
+ RT_NODE_ITER stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+} RT_ITER;
+
+
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+ uint64 key, uint64 value);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+ return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+ return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+ return DsaPointerIsValid(ptr);
+#else
+ return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+RT_NODE_4_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+RT_NODE_4_GET_INSERTPOS(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+ uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, uint64 *src_values,
+ uint8 *dst_chunks, uint64 *dst_values)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(uint64) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+ return node->children[chunk];
+}
+
+static inline uint64
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, uint64 value)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[idx] |= ((bitmapword) 1 << bitnum);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
+{
+ RT_PTR_ALLOC allocnode;
+ size_t allocsize;
+
+ if (inner)
+ allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+ else
+ allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+
+#ifdef RT_SHMEM
+ allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+ if (inner)
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+ allocsize);
+ else
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ allocsize);
+#endif
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->ctl->cnt[size_class]++;
+#endif
+
+ return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner)
+{
+ if (inner)
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+ else
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+
+ node->kind = kind;
+
+ if (kind == RT_NODE_KIND_256)
+ /* See comment for the RT_NODE type */
+ Assert(node->fanout == 0);
+ else
+ node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+ memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+ }
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+ int shift = RT_KEY_GET_SHIFT(key);
+ bool inner = shift > 0;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newnode->shift = shift;
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+ tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->count = oldnode->count;
+}
+#if 0
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static RT_NODE*
+RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_LOCAL node, uint8 new_kind)
+{
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ bool inner = !NODE_IS_LEAF(node);
+
+ allocnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+ RT_COPY_NODE(newnode, node);
+
+ return newnode;
+}
+#endif
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->ctl->root == allocnode)
+ {
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+ tree->ctl->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->ctl->cnt[i]--;
+ Assert(tree->ctl->cnt[i] >= 0);
+ }
+#endif
+
+#ifdef RT_SHMEM
+ dsa_free(tree->dsa, allocnode);
+#else
+ pfree(allocnode);
+#endif
+}
+
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
+ RT_PTR_ALLOC new_child, uint64 key)
+{
+ RT_PTR_LOCAL old = RT_PTR_GET_LOCAL(tree, old_child);
+
+#ifdef USE_ASSERT_CHECKING
+ RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+ Assert(old->shift == new->shift);
+#endif
+
+ if (parent == old)
+ {
+ /* Replace the root node with the new large node */
+ tree->ctl->root = new_child;
+ }
+ else
+ RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+ RT_FREE_NODE(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+ int target_shift;
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ int shift = root->shift + RT_NODE_SPAN;
+
+ target_shift = RT_KEY_GET_SHIFT(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ RT_NODE_INNER_4 *n4;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+ node->shift = shift;
+ node->count = 1;
+
+ n4 = (RT_NODE_INNER_4 *) node;
+ n4->base.chunks[0] = 0;
+ n4->children[0] = tree->ctl->root;
+
+ /* Update the root */
+ tree->ctl->root = allocnode;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC nodep, RT_PTR_LOCAL node)
+{
+ int shift = node->shift;
+
+ Assert(RT_PTR_GET_LOCAL(tree, nodep) == node);
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ RT_PTR_ALLOC allocchild;
+ RT_PTR_LOCAL newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool inner = newshift > 0;
+
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newchild->shift = newshift;
+ RT_NODE_INSERT_INNER(tree, parent, nodep, node, key, allocchild);
+
+ parent = node;
+ node = newchild;
+ nodep = allocchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+ tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/* Insert the child to the inner node */
+static bool
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Insert the value to the leaf node */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+ uint64 key, uint64 value)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+ RT_RADIX_TREE *tree;
+ MemoryContext old_ctx;
+#ifdef RT_SHMEM
+ dsa_pointer dp;
+#endif
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+ tree->context = ctx;
+
+#ifdef RT_SHMEM
+ tree->dsa = dsa;
+ dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+ tree->ctl->handle = dp;
+ tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+#else
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ RT_SIZE_CLASS_INFO[i].name,
+ RT_SIZE_CLASS_INFO[i].inner_blocksize,
+ RT_SIZE_CLASS_INFO[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ RT_SIZE_CLASS_INFO[i].name,
+ RT_SIZE_CLASS_INFO[i].leaf_blocksize,
+ RT_SIZE_CLASS_INFO[i].leaf_size);
+ }
+#endif
+
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+ RT_RADIX_TREE *tree;
+ dsa_pointer control;
+
+ /* XXX: memory context support */
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ tree->dsa = dsa;
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ /* XXX: do we need to set a callback on exit to detach dsa? */
+
+ return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ return tree->ctl->handle;
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->dsa, tree->ctl->handle); // XXX
+ //dsa_detach(tree->dsa);
+#else
+ pfree(tree->ctl);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+#endif
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
+{
+ int shift;
+ bool updated;
+ RT_PTR_LOCAL parent;
+ RT_PTR_ALLOC nodep;
+ RT_PTR_LOCAL node;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ /* Empty tree, create the root */
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_NEW_ROOT(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->ctl->max_val)
+ RT_EXTEND(tree, key);
+
+ //Assert(tree->ctl->root);
+
+ nodep = tree->ctl->root;
+ parent = RT_PTR_GET_LOCAL(tree, nodep);
+ shift = parent->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child;
+
+ node = RT_PTR_GET_LOCAL(tree, nodep);
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_SET_EXTEND(tree, key, value, parent, nodep, node);
+ return false;
+ }
+
+ parent = node;
+ nodep = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->ctl->num_keys++;
+
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+ Assert(value_p != NULL);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ return false;
+
+ node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ shift = node->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ return false;
+
+ node = RT_PTR_GET_LOCAL(tree, child);
+ shift -= RT_NODE_SPAN;
+ }
+
+ return RT_NODE_SEARCH_LEAF(node, key, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ return false;
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ RT_PTR_ALLOC child;
+
+ /* Push the current node to the stack */
+ stack[++level] = allocnode;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ return false;
+
+ allocnode = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_LEAF(node, key);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->ctl->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (!NODE_IS_EMPTY(node))
+ return true;
+
+ /* Free the empty leaf node */
+ RT_FREE_NODE(tree, allocnode);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ allocnode = stack[level--];
+
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_INNER(node, key);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!NODE_IS_EMPTY(node))
+ break;
+
+ /* The node became empty */
+ RT_FREE_NODE(tree, allocnode);
+ }
+
+ return true;
+}
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+ uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+ int level = from;
+ RT_PTR_LOCAL node = from_node;
+
+ for (;;)
+ {
+ RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/* Create and return the iterator for the given radix tree */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+ MemoryContext old_ctx;
+ RT_ITER *iter;
+ RT_PTR_LOCAL root;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree->ctl->root)
+ return iter;
+
+ root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+ top_level = root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->ctl->root)
+ return false;
+
+ for (;;)
+ {
+ RT_PTR_LOCAL child = NULL;
+ uint64 value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+ pfree(iter);
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+RT_SCOPE uint64
+RT_NUM_ENTRIES(RT_RADIX_TREE *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+ // XXX is this necessary?
+ Size total = sizeof(RT_RADIX_TREE);
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ total = dsa_get_total_size(tree->dsa);
+#else
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+#endif
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE_BASE_4 *n4 = (RT_NODE_BASE_4 *) node;
+
+ for (int i = 1; i < n4->n.count; i++)
+ Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ uint8 slot = n125->slot_idxs[i];
+ int idx = BM_IDX(slot);
+ int bitnum = BM_BIT(slot);
+
+ if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(slot < node->fanout);
+ Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
+ cnt += bmw_popcount(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(RT_RADIX_TREE *tree)
+{
+ ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+ tree->ctl->num_keys,
+ tree->ctl->root->shift / RT_NODE_SPAN,
+ tree->ctl->cnt[RT_CLASS_4_FULL],
+ tree->ctl->cnt[RT_CLASS_32_PARTIAL],
+ tree->ctl->cnt[RT_CLASS_32_FULL],
+ tree->ctl->cnt[RT_CLASS_125_FULL],
+ tree->ctl->cnt[RT_CLASS_256])));
+}
+
+static void
+rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
+{
+ char space[125] = {0};
+
+ fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
+ NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift);
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n4->base.chunks[i], n4->values[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n4->base.chunks[i]);
+
+ if (recurse)
+ rt_dump_node(n4->children[i], level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n32->base.chunks[i], n32->values[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ rt_dump_node(n32->children[i], level + 1, recurse);
+ }
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+
+ fprintf(stderr, "slot_idxs ");
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+ }
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
+
+ fprintf(stderr, ", isset-bitmap:");
+ for (int i = 0; i < BM_IDX(128); i++)
+ {
+ fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
+ }
+ fprintf(stderr, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+ }
+ else
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(RT_NODE_INNER_125_GET_CHILD(n125, i),
+ level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, RT_NODE_LEAF_256_GET_VALUE(n256, i));
+ }
+ else
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(RT_NODE_INNER_256_GET_CHILD(n256, i), level + 1,
+ recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+ tree->ctl->max_val, tree->ctl->max_val);
+
+ if (!tree->ctl->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->ctl->max_val)
+ {
+ elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+ key, key);
+ return;
+ }
+
+ node = tree->ctl->root;
+ shift = tree->ctl->root->shift;
+ while (shift >= 0)
+ {
+ RT_PTR_LOCAL child;
+
+ rt_dump_node(node, level, false);
+
+ if (NODE_IS_LEAF(node))
+ {
+ uint64 dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+ break;
+ }
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+}
+
+void
+rt_dump(RT_RADIX_TREE *tree)
+{
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+ RT_SIZE_CLASS_INFO[i].name,
+ RT_SIZE_CLASS_INFO[i].inner_size,
+ RT_SIZE_CLASS_INFO[i].inner_blocksize,
+ RT_SIZE_CLASS_INFO[i].leaf_size,
+ RT_SIZE_CLASS_INFO[i].leaf_blocksize);
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+
+ if (!tree->ctl->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ rt_dump_node(tree->ctl->root, 0, true);
+}
+#endif
+
+#endif /* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+
+/* locally declared macros */
+#undef NODE_IS_LEAF
+#undef NODE_IS_EMPTY
+#undef VAR_NODE_HAS_FREE_SLOT
+#undef FIXED_NODE_HAS_FREE_SLOT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_RADIX_TREE_MAGIC
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_BASE_4
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_4
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_4
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_4_FULL
+#undef RT_CLASS_32_PARTIAL
+#undef RT_CLASS_32_FULL
+#undef RT_CLASS_125_FULL
+#undef RT_CLASS_256
+#undef RT_KIND_MIN_SIZE_CLASS
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_NUM_ENTRIES
+#undef RT_DUMP
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_GROW_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_4_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_4_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..eb87866b90
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,106 @@
+/* TODO: shrink nodes */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(NODE_IS_LEAF(node));
+#else
+ Assert(!NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ int idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, (uint64 *) n4->values,
+ n4->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
+ n4->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, (uint64 *) n32->values,
+ n32->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int idx;
+ int bitnum;
+
+ if (slotpos == RT_NODE_125_INVALID_IDX)
+ return false;
+
+ idx = BM_IDX(slotpos);
+ bitnum = BM_BIT(slotpos);
+ n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+ n125->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+ RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+ break;
+ }
+ }
+
+ /* update statistics */
+ node->count--;
+
+ return true;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..e4faf54d9d
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,316 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+ RT_PTR_LOCAL newnode = NULL;
+ RT_PTR_ALLOC allocnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ const bool inner = false;
+ Assert(NODE_IS_LEAF(node));
+#else
+ const bool inner = true;
+ Assert(!NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ int idx;
+
+ idx = RT_NODE_4_SEARCH_EQ(&n4->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n4->values[idx] = value;
+#else
+ n4->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ {
+ RT_NODE32_TYPE *new32;
+ const uint8 new_kind = RT_NODE_KIND_32;
+ const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+ /* grow node from 4 to 32 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, inner);
+ RT_COPY_NODE(newnode, node);
+ //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
+ new32 = (RT_NODE32_TYPE *) newnode;
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
+ new32->base.chunks, new32->values);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
+ new32->base.chunks, new32->children);
+#endif
+ Assert(parent != NULL);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_4_GET_INSERTPOS(&n4->base, chunk);
+ int count = n4->base.n.count;
+
+ /* shift chunks and children */
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n4->base.chunks, n4->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n4->base.chunks, n4->children,
+ count, insertpos);
+#endif
+ }
+
+ n4->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n4->values[insertpos] = value;
+#else
+ n4->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_PARTIAL];
+ const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_FULL];
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx;
+
+ idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[idx] = value;
+#else
+ n32->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+ n32->base.n.fanout == class32_min.fanout)
+ {
+ /* grow to the next size class of this kind */
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
+
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+#ifdef RT_NODE_LEVEL_LEAF
+ memcpy(newnode, node, class32_min.leaf_size);
+#else
+ memcpy(newnode, node, class32_min.inner_size);
+#endif
+ newnode->fanout = class32_max.fanout;
+
+ Assert(parent != NULL);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ node = newnode;
+
+ /* also update pointer for this kind */
+ n32 = (RT_NODE32_TYPE *) newnode;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ {
+ RT_NODE125_TYPE *new125;
+ const uint8 new_kind = RT_NODE_KIND_125;
+ const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+ Assert(n32->base.n.fanout == class32_max.fanout);
+
+ /* grow node from 32 to 125 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, inner);
+ RT_COPY_NODE(newnode, node);
+ //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
+ new125 = (RT_NODE125_TYPE *) newnode;
+
+ for (int i = 0; i < class32_max.fanout; i++)
+ {
+ new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ new125->values[i] = n32->values[i];
+#else
+ new125->children[i] = n32->children[i];
+#endif
+ }
+
+ Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+ new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+ Assert(parent != NULL);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+ int count = n32->base.n.count;
+
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+ count, insertpos);
+#endif
+ }
+
+ n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = value;
+#else
+ n32->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int cnt = 0;
+
+ if (slotpos != RT_NODE_125_INVALID_IDX)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = value;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ {
+ RT_NODE256_TYPE *new256;
+ const uint8 new_kind = RT_NODE_KIND_256;
+ const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+ /* grow node from 125 to 256 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, inner);
+ RT_COPY_NODE(newnode, node);
+ //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
+ new256 = (RT_NODE256_TYPE *) newnode;
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+ RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ cnt++;
+ }
+
+ Assert(parent != NULL);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int idx;
+ bitmapword inverse;
+
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < BM_IDX(128); idx++)
+ {
+ if (n125->base.isset[idx] < ~((bitmapword) 0))
+ break;
+ }
+
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(n125->base.isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+ Assert(slotpos < node->fanout);
+
+ /* mark the slot used */
+ n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+ n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = value;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+#else
+ chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
+#endif
+ Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(n256, chunk, value);
+#else
+ RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ RT_VERIFY_NODE(node);
+
+ return chunk_exists;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..0b8b68df6c
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,138 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ bool found = false;
+ uint8 key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ uint64 value;
+
+ Assert(NODE_IS_LEAF(node_iter->node));
+#else
+ RT_PTR_LOCAL child = NULL;
+
+ Assert(!NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+ Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n4->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n4->children[node_iter->current_idx]);
+#endif
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+#ifdef RT_NODE_LEVEL_LEAF
+ if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+ if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = value;
+#endif
+ }
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return found;
+#else
+ return child;
+#endif
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..31e4978e4f
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,131 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ uint64 value = 0;
+
+ Assert(NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+#endif
+ Assert(!NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ int idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n4->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n4->values[idx];
+#else
+ child = n4->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n32->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[idx];
+#else
+ child = n32->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+ Assert(slotpos != RT_NODE_125_INVALID_IDX);
+ n125->children[slotpos] = new_child;
+#else
+ if (slotpos == RT_NODE_125_INVALID_IDX)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+ child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+ RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+ child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ }
+
+#ifdef RT_ACTION_UPDATE
+ return;
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(value_p != NULL);
+ *value_p = value;
+#else
+ Assert(child_p != NULL);
+ *child_p = child;
+#endif
+
+ return true;
+#endif /* RT_ACTION_UPDATE */
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 104386e674..c67f936880 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
test_pg_db_role_setting \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
subdir('test_pg_db_role_setting')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..61d842789d
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,631 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int rt_node_kind_fanouts[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 125, /* RT_NODE_KIND_125 */
+ 256 /* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ uint64 dummy;
+ uint64 key;
+ uint64 val;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ rt_radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_set(radixtree, keys[i], keys[i] + 1))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ uint64 val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx - 1]
+ : rt_node_kind_fanouts[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx]
+ : rt_node_kind_fanouts[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(radixtree_ctx, dsa);
+#else
+ radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ rt_free(radixtree);
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ test_basic(rt_node_kind_fanouts[i], false);
+ test_basic(rt_node_kind_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
--
2.39.0
v19-0009-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchtext/x-patch; charset=US-ASCII; name=v19-0009-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload
From ecc796c414a1dbd7c0c0df9bbcab0d922616b1ca Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 17 Jan 2023 17:20:37 +0700
Subject: [PATCH v19 9/9] Use TIDStore for storing dead tuple TID during lazy
vacuum
Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which is not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.
This changes to use TIDStore for this purpose. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.
Also, since we are no longer able to exactly estimate the maximum
number of TIDs can be stored based on the amount of memory. It also
changes to the column names max_dead_tuples and num_dead_tuples and to
show the progress information in bytes.
Furthermore, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, which is the
inital DSA segment size. Due to that, this change increase the minimum
maintenance_work_mem from 1MB to 2MB.
---
doc/src/sgml/monitoring.sgml | 8 +-
src/backend/access/heap/vacuumlazy.c | 168 +++++++--------------
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 76 +---------
src/backend/commands/vacuumparallel.c | 64 +++++---
src/backend/storage/lmgr/lwlock.c | 2 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/commands/progress.h | 4 +-
src/include/commands/vacuum.h | 25 +--
src/include/storage/lwlock.h | 1 +
src/test/regress/expected/cluster.out | 2 +-
src/test/regress/expected/create_index.out | 2 +-
src/test/regress/expected/rules.out | 4 +-
src/test/regress/sql/cluster.sql | 2 +-
src/test/regress/sql/create_index.sql | 2 +-
15 files changed, 122 insertions(+), 242 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 358d2ff90f..6ce7ea9e35 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6840,10 +6840,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>max_dead_tuples</structfield> <type>bigint</type>
+ <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples that we can store before needing to perform
+ Amount of dead tuple data that we can store before needing to perform
an index vacuum cycle, based on
<xref linkend="guc-maintenance-work-mem"/>.
</para></entry>
@@ -6851,10 +6851,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>num_dead_tuples</structfield> <type>bigint</type>
+ <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples collected since the last index vacuum cycle.
+ Amount of dead tuple data collected since the last index vacuum cycle.
</para></entry>
</row>
</tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..41af676dfa 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TidStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -259,8 +260,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -825,21 +827,21 @@ lazy_scan_heap(LVRelState *vacrel)
blkno,
next_unskippable_block,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TidStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
const int initprog_index[] = {
PROGRESS_VACUUM_PHASE,
PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
- PROGRESS_VACUUM_MAX_DEAD_TUPLES
+ PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
};
int64 initprog_val[3];
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +908,7 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ if (tidstore_is_full(vacrel->dead_items))
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -1037,11 +1038,18 @@ lazy_scan_heap(LVRelState *vacrel)
if (prunestate.has_lpdead_items)
{
Size freespace;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ result = tidstore_iterate_next(iter);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+ buf, vmbuffer);
+ Assert(!tidstore_iterate_next(iter));
+ tidstore_end_iterate(iter);
/* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ tidstore_reset(dead_items);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1086,7 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(tidstore_num_tids(dead_items) == 0);
}
/*
@@ -1249,7 +1257,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (tidstore_num_tids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1893,23 +1901,15 @@ retry:
*/
if (lpdead_items > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TidStore *dead_items = vacrel->dead_items;
vacrel->lpdead_item_pages++;
prunestate->has_lpdead_items = true;
- ItemPointerSetBlockNumber(&tmp, blkno);
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
/*
* It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -2129,8 +2129,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TidStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2138,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2198,7 +2190,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
return;
}
@@ -2227,7 +2219,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2254,8 +2246,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2300,7 +2292,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
}
/*
@@ -2373,7 +2365,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2410,10 +2402,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index = 0;
BlockNumber vacuumed_pages = 0;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2421,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
VACUUM_ERRCB_PHASE_VACUUM_HEAP,
InvalidBlockNumber, InvalidOffsetNumber);
- while (index < vacrel->dead_items->num_items)
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ while ((result = tidstore_iterate_next(iter)) != NULL)
{
BlockNumber blkno;
Buffer buf;
@@ -2437,7 +2431,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ blkno = result->blkno;
vacrel->blkno = blkno;
/*
@@ -2451,7 +2445,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+ buf, vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2461,6 +2456,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
vacuumed_pages++;
}
+ tidstore_end_iterate(iter);
vacrel->blkno = InvalidBlockNumber;
if (BufferIsValid(vmbuffer))
@@ -2470,14 +2466,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2495,11 +2490,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* LP_DEAD item on the page. The return value is the first index immediately
* after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets, Buffer buffer, Buffer vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int nunused = 0;
@@ -2518,16 +2512,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = offsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2586,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -3093,46 +3081,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
return vacrel->nonempty_pages;
}
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
/*
* Allocate dead_items (either using palloc, or in dynamic shared memory).
* Sets dead_items in vacrel for caller.
@@ -3143,11 +3091,9 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
+ int vac_work_mem = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
@@ -3174,7 +3120,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
+ vac_work_mem,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3187,11 +3133,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = tidstore_create(vac_work_mem, NULL);
}
/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index d2a8c82900..fdc8a99bba 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1164,7 +1164,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
END AS phase,
S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
- S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+ S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7b1a4b127e..358ad25996 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int vac_cmp_itemptr(const void *left, const void *right);
/*
* Primary entry point for manual VACUUM and ANALYZE commands
@@ -2303,16 +2302,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TidStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ tidstore_num_tids(dead_items))));
return istat;
}
@@ -2343,18 +2342,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
@@ -2365,60 +2352,7 @@ vac_max_items_to_alloc_size(int max_items)
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch((void *) itemptr,
- (void *) dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
-
- return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
- BlockNumber lblk,
- rblk;
- OffsetNumber loff,
- roff;
-
- lblk = ItemPointerGetBlockNumber((ItemPointer) left);
- rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
- if (lblk < rblk)
- return -1;
- if (lblk > rblk)
- return 1;
-
- loff = ItemPointerGetOffsetNumber((ItemPointer) left);
- roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
- if (loff < roff)
- return -1;
- if (loff > roff)
- return 1;
+ TidStore *dead_items = (TidStore *) state;
- return 0;
+ return tidstore_lookup_tid(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..4c0ce4b7e6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
* use small integers.
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 3
#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 4
#define PARALLEL_VACUUM_KEY_WAL_USAGE 5
@@ -103,6 +103,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TidStore */
+ tidstore_handle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ dsa_area *dead_items_area;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = tidstore_create(vac_work_mem, dead_items_dsa);
+ pvs->dead_items = dead_items;
+ pvs->dead_items_area = dead_items_dsa;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = tidstore_get_handle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ tidstore_destroy(pvs->dead_items);
+ dsa_detach(pvs->dead_items_area);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TidStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ tidstore_detach(pvs.dead_items);
+ dsa_detach(dead_items_area);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 196bece0a3..ff75fae88a 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -186,6 +186,8 @@ static const char *const BuiltinTrancheNames[] = {
"PgStatsHash",
/* LWTRANCHE_PGSTATS_DATA: */
"PgStatsData",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 5025e80f89..edee8a2b2b 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2301,7 +2301,7 @@ struct config_int ConfigureNamesInt[] =
GUC_UNIT_KB
},
&maintenance_work_mem,
- 65536, 1024, MAX_KILOBYTES,
+ 65536, 2048, MAX_KILOBYTES,
NULL, NULL, NULL
},
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
#define PROGRESS_VACUUM_HEAP_BLKS_SCANNED 2
#define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED 3
#define PROGRESS_VACUUM_NUM_INDEX_VACUUMS 4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES 5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES 5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES 6
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..220d89fff7 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
MultiXactId MultiXactCutoff;
};
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TidStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int vac_work_mem,
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index e4162db613..40dda03088 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -204,6 +204,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DSA,
LWTRANCHE_PGSTATS_HASH,
LWTRANCHE_PGSTATS_DATA,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
-- ensure we don't use the index in CLUSTER nor the checking SELECTs
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
-- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index a969ae63eb..630869255f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param3 AS heap_blks_scanned,
s.param4 AS heap_blks_vacuumed,
s.param5 AS index_vacuum_count,
- s.param6 AS max_dead_tuples,
- s.param7 AS num_dead_tuples
+ s.param6 AS max_dead_tuple_bytes,
+ s.param7 AS dead_tuple_bytes
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT s.stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
--
2.39.0
v19-0006-Shared-memory-cleanups.patchtext/x-patch; charset=US-ASCII; name=v19-0006-Shared-memory-cleanups.patchDownload
From afccfde982c95815b4a7b8dcef62ae5bc1d416d0 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 17 Jan 2023 16:50:38 +0700
Subject: [PATCH v19 6/9] Shared memory cleanups
---
src/include/lib/radixtree.h | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index ba326562d5..7c7b126b98 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1378,8 +1378,6 @@ RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
- /* XXX: do we need to set a callback on exit to detach dsa? */
-
return tree;
}
@@ -1412,8 +1410,7 @@ RT_FREE(RT_RADIX_TREE *tree)
* other backends access the memory formerly occupied by this radix tree.
*/
tree->ctl->magic = 0;
- dsa_free(tree->dsa, tree->ctl->handle); // XXX
- //dsa_detach(tree->dsa);
+ dsa_free(tree->dsa, tree->ctl->handle);
#else
pfree(tree->ctl);
@@ -1452,8 +1449,6 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
if (key > tree->ctl->max_val)
RT_EXTEND(tree, key);
- //Assert(tree->ctl->root);
-
nodep = tree->ctl->root;
parent = RT_PTR_GET_LOCAL(tree, nodep);
shift = parent->shift;
--
2.39.0
v19-0007-Make-RT_DELETE-optional.patchtext/x-patch; charset=US-ASCII; name=v19-0007-Make-RT_DELETE-optional.patchDownload
From 88e0f6202959fa1a872eacb01c0e24cb27ae66d4 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 17 Jan 2023 17:34:28 +0700
Subject: [PATCH v19 7/9] Make RT_DELETE optional
To prevent compiler warnings in TIDStore
---
src/include/lib/radixtree.h | 16 +++++++++++++++-
src/test/modules/test_radixtree/test_radixtree.c | 1 +
2 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 7c7b126b98..c2df8e882e 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -58,7 +58,6 @@
* RT_GET_HANDLE - Return the handle of the radix tree
* RT_SEARCH - Search a key-value pair
* RT_SET - Set a key-value pair
- * RT_DELETE - Delete a key-value pair
* RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
* RT_ITERATE_NEXT - Return next key-value pair, if any
* RT_END_ITER - End iteration
@@ -70,6 +69,12 @@
* RT_ITERATE_NEXT() ensures returning key-value pairs in the ascending
* order of the key.
*
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE - Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
* Copyright (c) 2022, PostgreSQL Global Development Group
*
* IDENTIFICATION
@@ -106,7 +111,9 @@
#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
#define RT_DUMP RT_MAKE_NAME(dump)
#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
@@ -213,7 +220,9 @@ RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
+#ifdef RT_USE_DELETE
RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
@@ -1264,6 +1273,7 @@ RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, uint64 *value_p)
#undef RT_NODE_LEVEL_LEAF
}
+#ifdef RT_USE_DELETE
/*
* Search for the child pointer corresponding to 'key' in the given node.
*
@@ -1289,6 +1299,7 @@ RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
#include "lib/radixtree_delete_impl.h"
#undef RT_NODE_LEVEL_LEAF
}
+#endif
/* Insert the child to the inner node */
static bool
@@ -1523,6 +1534,7 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
return RT_NODE_SEARCH_LEAF(node, key, value_p);
}
+#ifdef RT_USE_DELETE
/*
* Delete the given key from the radix tree. Return true if the key is found (and
* deleted), otherwise do nothing and return false.
@@ -1609,6 +1621,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
return true;
}
+#endif
static inline void
RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
@@ -2166,6 +2179,7 @@ rt_dump(RT_RADIX_TREE *tree)
#undef RT_BEGIN_ITERATE
#undef RT_ITERATE_NEXT
#undef RT_END_ITERATE
+#undef RT_USE_DELETE
#undef RT_DELETE
#undef RT_MEMORY_USAGE
#undef RT_DUMP
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 076173f628..f01d4dd733 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -104,6 +104,7 @@ static const test_spec test_specs[] = {
#define RT_SCOPE static
#define RT_DECLARE
#define RT_DEFINE
+#define RT_USE_DELETE
// WIP: compiles with warnings because rt_attach is defined but not used
// #define RT_SHMEM
#include "lib/radixtree.h"
--
2.39.0
v19-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchtext/x-patch; charset=US-ASCII; name=v19-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload
From 5205bba3e7d4542fe350fd3606acb78caace866d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v19 8/9] Add TIDStore, to store sets of TIDs (ItemPointerData)
efficiently.
The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.
The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.
This includes a unit test module, in src/test/modules/test_tidstore.
---
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 587 ++++++++++++++++++
src/include/access/tidstore.h | 49 ++
src/test/modules/test_tidstore/Makefile | 23 +
.../test_tidstore/expected/test_tidstore.out | 13 +
src/test/modules/test_tidstore/meson.build | 34 +
.../test_tidstore/sql/test_tidstore.sql | 7 +
.../test_tidstore/test_tidstore--1.0.sql | 8 +
.../test_tidstore/test_tidstore.control | 4 +
10 files changed, 727 insertions(+)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
create mode 100644 src/test/modules/test_tidstore/Makefile
create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
create mode 100644 src/test/modules/test_tidstore/meson.build
create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore.control
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..4170d13b3c
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,587 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, Tid are encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach(). It can support concurrent updates but only one process
+ * is allowed to iterate over the TidStore at a time.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, item pointers are represented as a pair of 64-bit
+ * key and 64-bit value. First, we construct 64-bit unsigned integer key that
+ * combines the block number and the offset number. The lowest 11 bits represent
+ * the offset number, and the next 32 bits are block number. That is, only 43
+ * bits are used:
+ *
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ *
+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TidSTORE_OFFSET_NBITS). We are frugal with
+ * the bits, because smaller keys could help keeping the radix tree shallow.
+ *
+ * XXX: If we want to support other table AMs that want to use the full range
+ * of possible offset numbers, we'll need to change this.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits, and
+ * the rest 37 bits are used as the key:
+ *
+ * value = bitmap representation of XXXXXX
+ * key = XXXXXYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYuu
+ *
+ * The maximum height of the radix tree is 5.
+ *
+ * XXX: if we want to support non-heap table AM, we need to reconsider
+ * TIDSTORE_OFFSET_NBITS value.
+ */
+#define TIDSTORE_OFFSET_NBITS 11
+#define TIDSTORE_VALUE_NBITS 6
+
+/*
+ * Memory consumption depends on the number of Tids stored, but also on the
+ * distribution of them and how the radix tree stores them. The maximum bytes
+ * that a TidStore can use is specified by the max_bytes in tidstore_create().
+ *
+ * In non-shared cases, the radix tree uses a slab allocator for each kind of
+ * node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate the
+ * largest radix tree node in a new slab block, which is approximately 70kB.
+ * Therefore, we deduct 70kB from the maximum bytes.
+ *
+ * In shared cases, DSA allocates the memory segments to bit enough to follow
+ * a geometric series that approximately doubles the total DSA size. So we
+ * limit the maximum bytes for a TidStore to 75%. The 75% threshold perfectly
+ * works in case where the maximum bytes is power-of-2. In other cases, we
+ * use 60& threshold.
+ */
+#define TIDSTORE_MEMORY_DEDUCT_BYTES (1024L * 70) /* 70kB */
+
+/* Get block number from the key */
+#define KEY_GET_BLKNO(key) \
+ ((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+/* The header object for a TidStore */
+typedef struct TidStoreControl
+{
+ /*
+ * 'num_tids' is the number of Tids stored so far. 'max_byte' is the maximum
+ * bytes a TidStore can use. These two fields are commonly used in both
+ * non-shared case and shared case.
+ */
+ uint32 num_tids;
+ uint64 max_bytes;
+
+ /* The below fields are used only in shared case */
+
+ uint32 magic;
+
+ /* handles for TidStore and radix tree */
+ tidstore_handle handle;
+ shared_rt_handle tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+ /*
+ * Control object. This is allocated in DSA area 'area' in the shared
+ * case, otherwise in backend-local memory.
+ */
+ TidStoreControl *control;
+
+ /* Storage for Tids */
+ union
+ {
+ local_rt_radix_tree *local;
+ shared_rt_radix_tree *shared;
+ } tree;
+
+ /* DSA area for TidStore if used */
+ dsa_area *area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+ TidStore *ts;
+
+ /* iterator of radix tree */
+ union
+ {
+ shared_rt_iter *shared;
+ local_rt_iter *local;
+ } tree_iter;
+
+ /* we returned all tids? */
+ bool finished;
+
+ /* save for the next iteration */
+ uint64 next_key;
+ uint64 next_val;
+
+ /* output for the caller */
+ TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(uint64 max_bytes, dsa_area *area)
+{
+ TidStore *ts;
+
+ ts = palloc0(sizeof(TidStore));
+
+ /*
+ * Create the radix tree for the main storage.
+ *
+ * We calculate the maximum bytes for the TidStore in different ways
+ * for non-shared case and shared case. Please refer to the comment
+ * TIDSTORE_MEMORY_DEDUCT for details.
+ */
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+ float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, area);
+
+ dp = dsa_allocate0(area, sizeof(TidStoreControl));
+ ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+ ts->control->max_bytes =(uint64) (max_bytes * ratio);
+ ts->area = area;
+
+ ts->control->magic = TIDSTORE_MAGIC;
+ ts->control->handle = dp;
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+ }
+ else
+ {
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+ ts->control->max_bytes = max_bytes - TIDSTORE_MEMORY_DEDUCT_BYTES;
+ }
+
+ return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ /* create per-backend state */
+ ts = palloc0(sizeof(TidStore));
+
+ /* Find the control object in shared memory */
+ control = handle;
+
+ /* Set up the TidStore */
+ ts->control = (TidStoreControl *) dsa_get_address(area, control);
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+ ts->area = area;
+
+ return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ shared_rt_detach(ts->tree.shared);
+ pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory. The caller must be certain that
+ * no other backend will attempt to access the TidStore before calling this
+ * function. Other backend must explicitly call tidstore_detach to free up
+ * backend-local memory associated with the TidStore. The backend that calls
+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ ts->control->magic = 0;
+ dsa_free(ts->area, ts->control->handle);
+ shared_rt_free(ts->tree.shared);
+ }
+ else
+ {
+ pfree(ts->control);
+ local_rt_free(ts->tree.local);
+ }
+
+ pfree(ts);
+}
+
+/* Forget all collected Tids */
+void
+tidstore_reset(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+
+ /*
+ * Free the current radix tree, and Return allocated DSM segments
+ * to the operating system, if necessary. */
+ if (TidStoreIsShared(ts))
+ {
+ shared_rt_free(ts->tree.shared);
+ dsa_trim(ts->area);
+
+ /* Recreate the radix tree */
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area);
+
+ /* update the radix tree handle as we recreated it */
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+ }
+ else
+ {
+ local_rt_free(ts->tree.local);
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+ }
+}
+
+static inline void
+tidstore_insert_kv(TidStore *ts, uint64 key, uint64 val)
+{
+ if (TidStoreIsShared(ts))
+ shared_rt_set(ts->tree.shared, key, val);
+ else
+ local_rt_set(ts->tree.local, key, val);
+}
+
+/*
+ * Add Tids on a block to TidStore. The caller must ensure the offset numbers
+ * in 'offsets' are ordered in ascending order.
+ */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 key;
+ uint64 val = 0;
+ ItemPointerData tid;
+
+ ItemPointerSetBlockNumber(&tid, blkno);
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint32 off;
+
+ ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+ key = tid_to_key_off(&tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ /* insert the key-value */
+ tidstore_insert_kv(ts, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= UINT64CONST(1) << off;
+ }
+
+ if (last_key != PG_UINT64_MAX)
+ {
+ /* insert the key-value */
+ tidstore_insert_kv(ts, last_key, val);
+ }
+
+ /* update statistics */
+ ts->control->num_tids += num_offsets;
+}
+
+/* Return true if the given Tid is present in TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val;
+ uint32 off;
+ bool found;
+
+ key = tid_to_key_off(tid, &off);
+
+ found = TidStoreIsShared(ts) ?
+ shared_rt_search(ts->tree.shared, key, &val) :
+ local_rt_search(ts->tree.local, key, &val);
+
+ if (!found)
+ return false;
+
+ return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. The caller must be certain that
+ * no other backend will attempt to update the TidStore during the iteration.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+ TidStoreIter *iter;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ iter = palloc0(sizeof(TidStoreIter));
+ iter->ts = ts;
+ iter->result.blkno = InvalidBlockNumber;
+
+ if (TidStoreIsShared(ts))
+ iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+ else
+ iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+ /* If the TidStore is empty, there is no business */
+ if (ts->control->num_tids == 0)
+ iter->finished = true;
+
+ return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+ else
+ return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a TidStoreIterResult representing Tids
+ * in one page. Offset numbers in the result is sorted.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+ TidStoreIterResult *result = &(iter->result);
+
+ if (iter->finished)
+ return NULL;
+
+ if (BlockNumberIsValid(result->blkno))
+ {
+ result->num_offsets = 0;
+ tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (tidstore_iter_kv(iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = KEY_GET_BLKNO(key);
+
+ if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+ {
+ /*
+ * Remember the key-value pair for the next block for the
+ * next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+ return result;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_extract_tids(iter, key, val);
+ }
+
+ iter->finished = true;
+ return result;
+}
+
+/* Finish an iteration over TidStore */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+ if (TidStoreIsShared(iter->ts))
+ shared_rt_end_iterate(iter->tree_iter.shared);
+ else
+ local_rt_end_iterate(iter->tree_iter.local);
+
+ pfree(iter);
+}
+
+/* Return the number of Tids we collected so far */
+uint64
+tidstore_num_tids(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+uint64
+tidstore_max_memory(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+uint64
+tidstore_memory_usage(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return (uint64) sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+ else
+ return (uint64) sizeof(TidStore) + sizeof(TidStore) +
+ local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->handle;
+}
+
+/* Extract Tids from key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+ TidStoreIterResult *result = (&iter->result);
+
+ for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ if ((val & (UINT64CONST(1) << i)) == 0)
+ continue;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= i;
+
+ off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+ result->offsets[result->num_offsets++] = off;
+ }
+
+ result->blkno = KEY_GET_BLKNO(key);
+}
+
+/*
+ * Encode a Tid to key and val.
+ */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint64 tid_i;
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+ *off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+ upper = tid_i >> TIDSTORE_VALUE_NBITS;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ return upper;
+}
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..4bffdf0920
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "lib/radixtree.h"
+#include "storage/itemptr.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+ BlockNumber blkno;
+ OffsetNumber offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+ int num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(uint64 max_bytes, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern uint64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern uint64 tidstore_max_memory(TidStore *ts);
+extern uint64 tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif /* TIDSTORE_H */
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..1973963440
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE: testing empty tidstore
+NOTICE: testing basic operations
+ test_tidstore
+---------------
+
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..3365b073a4
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+ 'test_tidstore.c',
+)
+
+if host_system == 'windows'
+ test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_tidstore',
+ '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+ test_tidstore_sources,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+ 'test_tidstore.control',
+ 'test_tidstore--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_tidstore',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_tidstore',
+ ],
+ },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
--
2.39.0
On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
+ * Add Tids on a block to TidStore. The caller must ensure the offset
numbers
+ * in 'offsets' are ordered in ascending order.
Must? What happens otherwise?
It ends up missing TIDs by overwriting the same key with different
values. Is it better to have a bool argument, say need_sort, to sort
the given array if the caller wants?
Now that I've studied it some more, I see what's happening: We need all
bits set in the "value" before we insert it, since it would be too
expensive to retrieve the current value, add one bit, and put it back.
Also, as a consequence of the encoding, part of the tid is in the key, and
part in the value. It makes more sense now, but it needs more than zero
comments.
As for the order, I don't think it's the responsibility of the caller to
guess if it needs sorting -- if unordered offsets lead to data loss, this
function needs to take care of it.
+ uint64 last_key = PG_UINT64_MAX;
I'm having some difficulty understanding this sentinel and how it's
used.
Will improve the logic.
Part of the problem is the English language: "last" can mean "previous" or
"at the end", so maybe some name changes would help.
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Jan 17, 2023 at 8:06 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:Attached is an update that mostly has the modest goal of getting CI green again. v19-0003 has squashed the entire radix tree template from previously. I've kept out the perf test module for now -- still needs updating.
[05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
external symbol pg_popcount64
[05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
fatal error LNK1120: 1 unresolved externalsYeah, I'm not sure what's causing that. Since that comes from a debugging function, we could work around it, but it would be nice to understand why, so I'll probably have to experiment on my CI repo.
I'm still confused by this error, because it only occurs in the test module. I successfully built with just 0002 in CI so elsewhere where bmw_* symbols resolve just fine on all platforms. I've worked around the error in v19-0004 by using the general-purpose pg_popcount() function. We only need to count bits in assert builds, so it doesn't matter a whole lot.
I spent today investigating this issue, I found out that on Windows,
libpgport_src.a is not linked when building codes outside of
src/backend unless explicitly linking it. It's not a problem on Linux
etc. but the linker raises a fatal error on Windows. I'm not sure the
right way to fix it but the attached patch resolved the issue on
cfbot. Since it seems not to be related to 0002 patch but maybe the
designed behavior or a problem in meson. We can discuss it on a
separate thread.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
link_pgport_src.patchapplication/octet-stream; name=link_pgport_src.patchDownload
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
index f96bf159d6..3f444ac05e 100644
--- a/src/test/modules/test_radixtree/meson.build
+++ b/src/test/modules/test_radixtree/meson.build
@@ -12,6 +12,7 @@ endif
test_radixtree = shared_module('test_radixtree',
test_radixtree_sources,
+ link_with: [pgport_srv],
kwargs: pg_mod_args,
)
testprep_targets += test_radixtree
On Tue, Jan 17, 2023 at 8:06 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:Attached is an update that mostly has the modest goal of getting CI green again. v19-0003 has squashed the entire radix tree template from previously. I've kept out the perf test module for now -- still needs updating.
[05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
external symbol pg_popcount64
[05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
fatal error LNK1120: 1 unresolved externalsYeah, I'm not sure what's causing that. Since that comes from a debugging function, we could work around it, but it would be nice to understand why, so I'll probably have to experiment on my CI repo.
I'm still confused by this error, because it only occurs in the test module. I successfully built with just 0002 in CI so elsewhere where bmw_* symbols resolve just fine on all platforms. I've worked around the error in v19-0004 by using the general-purpose pg_popcount() function. We only need to count bits in assert builds, so it doesn't matter a whole lot.
+ /* XXX: do we need to set a callback on exit to detach dsa? */
In the current shared radix tree design, it's a caller responsible
that they create (or attach to) a DSA area and pass it to RT_CREATE()
or RT_ATTACH(). It enables us to use one DSA not only for the radix
tree but also other data. Which is more flexible. So the caller needs
to detach from the DSA somehow, so I think we don't need to set a
callback here for that.--- + dsa_free(tree->dsa, tree->ctl->handle); // XXX + //dsa_detach(tree->dsa);Similar to above, I think we should not detach from the DSA area here.
Given that the DSA area used by the radix tree could be used also by
other data, I think that in RT_FREE() we need to free each radix tree
node allocated in DSA. In lazy vacuum, we check the memory usage
instead of the number of TIDs and need to reset the TidScan after an
index scan. So it does RT_FREE() and dsa_trim() to return DSM segments
to the OS. I've implemented rt_free_recurse() for this purpose in the
v15 version patch.-- - Assert(tree->root); + //Assert(tree->ctl->root);I think we don't need this assertion in the first place. We check it
at the beginning of the function.I've removed these in v19-0006.
That sounds like a good idea. It's also worth wondering if we even need RT_NUM_ENTRIES at all, since the caller is capable of keeping track of that if necessary. It's also misnamed, since it's concerned with the number of keys. The vacuum case cares about the number of TIDs, and not number of (encoded) keys. Even if we ever (say) changed the key to blocknumber and value to Bitmapset, the number of keys might not be interesting.
Right. In fact, TIdStore doesn't use RT_NUM_ENTRIES.
I've moved it to the test module, which uses it extensively. There, the name is more clear what it's for, so I didn't change the name.
It sounds like we should at least make the delete functionality optional. (Side note on optional functions: if an implementation didn't care about iteration or its order, we could optimize insertion into linear nodes)
Agreed.
Done in v19-0007.
v19-0009 is just a rebase over some more vacuum cleanups.
Thank you for updating the patches!
I've attached new version patches. There is no change from v19 patch
for 0001 through 0006. And 0004, 0005 and 0006 patches look good to
me. We can merge them into 0003 patch.
0007 patch fixes functions that are defined when RT_DEBUG. These
functions might be removed before the commit but this is useful at
least under development. 0008 patch fixes a bug in
RT_CHUNK_VALUES_ARRAY_SHIFT() and adds tests for that. 0009 patch
fixes the cfbot issue by linking pgport_srv. 0010 patch adds
RT_FREE_RECURSE() to free all radix tree nodes allocated in DSA. 0011
patch updates copyright etc. 0012 and 0013 patches are updated patches
that incorporate all comments I got so far.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v20-0013-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchapplication/octet-stream; name=v20-0013-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload
From 33f4c5ceed5659224e084549a608414f0f1495d4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 17 Jan 2023 17:20:37 +0700
Subject: [PATCH v20 13/13] Use TIDStore for storing dead tuple TID during lazy
vacuum
Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which is not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.
This changes to use TIDStore for this purpose. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.
Also, since we are no longer able to exactly estimate the maximum
number of TIDs can be stored based on the amount of memory. It also
changes to the column names max_dead_tuples and num_dead_tuples and to
show the progress information in bytes.
Furthermore, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, which is the
inital DSA segment size. Due to that, this change increase the minimum
maintenance_work_mem from 1MB to 2MB.
XXX: needs to bump catalog version
---
doc/src/sgml/monitoring.sgml | 8 +-
src/backend/access/heap/vacuumlazy.c | 210 +++++++--------------
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 76 +-------
src/backend/commands/vacuumparallel.c | 64 ++++---
src/backend/storage/lmgr/lwlock.c | 2 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/commands/progress.h | 4 +-
src/include/commands/vacuum.h | 25 +--
src/include/storage/lwlock.h | 1 +
src/test/regress/expected/cluster.out | 2 +-
src/test/regress/expected/create_index.out | 2 +-
src/test/regress/expected/rules.out | 4 +-
src/test/regress/sql/cluster.sql | 2 +-
src/test/regress/sql/create_index.sql | 2 +-
15 files changed, 138 insertions(+), 268 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c9bc091045..68b13de735 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6844,10 +6844,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>max_dead_tuples</structfield> <type>bigint</type>
+ <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples that we can store before needing to perform
+ Amount of dead tuple data that we can store before needing to perform
an index vacuum cycle, based on
<xref linkend="guc-maintenance-work-mem"/>.
</para></entry>
@@ -6855,10 +6855,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>num_dead_tuples</structfield> <type>bigint</type>
+ <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples collected since the last index vacuum cycle.
+ Amount of dead tuple data collected since the last index vacuum cycle.
</para></entry>
</row>
</tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..90f8a5e087 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TidStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -220,17 +221,21 @@ typedef struct LVRelState
typedef struct LVPagePruneState
{
bool hastup; /* Page prevents rel truncation? */
- bool has_lpdead_items; /* includes existing LP_DEAD items */
+
+ /* collected LP_DEAD items including existing LP_DEAD items */
+ int lpdead_items;
+ OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
/*
* State describes the proper VM bit states to set for the page following
- * pruning and freezing. all_visible implies !has_lpdead_items, but don't
+ * pruning and freezing. all_visible implies !HAS_LPDEAD_ITEMS(), but don't
* trust all_frozen result unless all_visible is also set to true.
*/
bool all_visible; /* Every item visible to all? */
bool all_frozen; /* provided all_visible is also true */
TransactionId visibility_cutoff_xid; /* For recovery conflicts */
} LVPagePruneState;
+#define HAS_LPDEAD_ITEMS(state) (((state).lpdead_items) > 0)
/* Struct for saving and restoring vacuum error information. */
typedef struct LVSavedErrInfo
@@ -259,8 +264,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -825,21 +831,21 @@ lazy_scan_heap(LVRelState *vacrel)
blkno,
next_unskippable_block,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TidStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
const int initprog_index[] = {
PROGRESS_VACUUM_PHASE,
PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
- PROGRESS_VACUUM_MAX_DEAD_TUPLES
+ PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
};
int64 initprog_val[3];
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +912,7 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ if (tidstore_is_full(vacrel->dead_items))
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -1018,7 +1023,7 @@ lazy_scan_heap(LVRelState *vacrel)
*/
lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
- Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+ Assert(!prunestate.all_visible || !HAS_LPDEAD_ITEMS(prunestate));
/* Remember the location of the last page with nonremovable tuples */
if (prunestate.hastup)
@@ -1034,14 +1039,12 @@ lazy_scan_heap(LVRelState *vacrel)
* performed here can be thought of as the one-pass equivalent of
* a call to lazy_vacuum().
*/
- if (prunestate.has_lpdead_items)
+ if (HAS_LPDEAD_ITEMS(prunestate))
{
Size freespace;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
- /* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+ prunestate.lpdead_items, buf, vmbuffer);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1081,16 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(tidstore_num_tids(dead_items) == 0);
+ }
+ else if (HAS_LPDEAD_ITEMS(prunestate))
+ {
+ /* Save details of the LP_DEAD items from the page */
+ tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
+ prunestate.lpdead_items);
+
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
}
/*
@@ -1145,7 +1157,7 @@ lazy_scan_heap(LVRelState *vacrel)
* There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
* set, however.
*/
- else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+ else if (HAS_LPDEAD_ITEMS(prunestate) && PageIsAllVisible(page))
{
elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
vacrel->relname, blkno);
@@ -1193,7 +1205,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Final steps for block: drop cleanup lock, record free space in the
* FSM
*/
- if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+ if (HAS_LPDEAD_ITEMS(prunestate) && vacrel->do_index_vacuuming)
{
/*
* Wait until lazy_vacuum_heap_rel() to save free space. This
@@ -1249,7 +1261,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (tidstore_num_tids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1543,13 +1555,11 @@ lazy_scan_prune(LVRelState *vacrel,
HTSV_Result res;
int tuples_deleted,
tuples_frozen,
- lpdead_items,
live_tuples,
recently_dead_tuples;
int nnewlpdead;
HeapPageFreeze pagefrz;
int64 fpi_before = pgWalUsage.wal_fpi;
- OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1571,7 +1581,6 @@ retry:
pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
tuples_deleted = 0;
tuples_frozen = 0;
- lpdead_items = 0;
live_tuples = 0;
recently_dead_tuples = 0;
@@ -1580,9 +1589,9 @@ retry:
*
* We count tuples removed by the pruning step as tuples_deleted. Its
* final value can be thought of as the number of tuples that have been
- * deleted from the table. It should not be confused with lpdead_items;
- * lpdead_items's final value can be thought of as the number of tuples
- * that were deleted from indexes.
+ * deleted from the table. It should not be confused with
+ * prunestate->lpdead_items; prunestate->lpdead_items's final value can
+ * be thought of as the number of tuples that were deleted from indexes.
*/
tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
InvalidTransactionId, 0, &nnewlpdead,
@@ -1593,7 +1602,7 @@ retry:
* requiring freezing among remaining tuples with storage
*/
prunestate->hastup = false;
- prunestate->has_lpdead_items = false;
+ prunestate->lpdead_items = 0;
prunestate->all_visible = true;
prunestate->all_frozen = true;
prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1638,7 +1647,7 @@ retry:
* (This is another case where it's useful to anticipate that any
* LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
*/
- deadoffsets[lpdead_items++] = offnum;
+ prunestate->deadoffsets[prunestate->lpdead_items++] = offnum;
continue;
}
@@ -1875,7 +1884,7 @@ retry:
*/
#ifdef USE_ASSERT_CHECKING
/* Note that all_frozen value does not matter when !all_visible */
- if (prunestate->all_visible && lpdead_items == 0)
+ if (prunestate->all_visible && prunestate->lpdead_items == 0)
{
TransactionId cutoff;
bool all_frozen;
@@ -1888,28 +1897,9 @@ retry:
}
#endif
- /*
- * Now save details of the LP_DEAD items from the page in vacrel
- */
- if (lpdead_items > 0)
+ if (prunestate->lpdead_items > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
-
vacrel->lpdead_item_pages++;
- prunestate->has_lpdead_items = true;
-
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
/*
* It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1928,7 +1918,7 @@ retry:
/* Finally, add page-local counts to whole-VACUUM counts */
vacrel->tuples_deleted += tuples_deleted;
vacrel->tuples_frozen += tuples_frozen;
- vacrel->lpdead_items += lpdead_items;
+ vacrel->lpdead_items += prunestate->lpdead_items;
vacrel->live_tuples += live_tuples;
vacrel->recently_dead_tuples += recently_dead_tuples;
}
@@ -2129,8 +2119,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TidStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2128,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2198,7 +2180,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
return;
}
@@ -2227,7 +2209,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2254,8 +2236,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2300,7 +2282,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
}
/*
@@ -2373,7 +2355,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2410,10 +2392,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index = 0;
BlockNumber vacuumed_pages = 0;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2411,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
VACUUM_ERRCB_PHASE_VACUUM_HEAP,
InvalidBlockNumber, InvalidOffsetNumber);
- while (index < vacrel->dead_items->num_items)
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ while ((result = tidstore_iterate_next(iter)) != NULL)
{
BlockNumber blkno;
Buffer buf;
@@ -2437,7 +2421,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ blkno = result->blkno;
vacrel->blkno = blkno;
/*
@@ -2451,7 +2435,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+ buf, vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2461,6 +2446,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
vacuumed_pages++;
}
+ tidstore_end_iterate(iter);
vacrel->blkno = InvalidBlockNumber;
if (BufferIsValid(vmbuffer))
@@ -2470,14 +2456,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2495,11 +2480,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* LP_DEAD item on the page. The return value is the first index immediately
* after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets, Buffer buffer, Buffer vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int nunused = 0;
@@ -2518,16 +2502,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = offsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2576,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -3093,46 +3071,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
return vacrel->nonempty_pages;
}
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
/*
* Allocate dead_items (either using palloc, or in dynamic shared memory).
* Sets dead_items in vacrel for caller.
@@ -3143,11 +3081,9 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
+ int vac_work_mem = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
@@ -3174,7 +3110,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
+ vac_work_mem,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3187,11 +3123,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = tidstore_create(vac_work_mem, NULL);
}
/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index d2a8c82900..fdc8a99bba 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1164,7 +1164,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
END AS phase,
S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
- S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+ S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7b1a4b127e..358ad25996 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int vac_cmp_itemptr(const void *left, const void *right);
/*
* Primary entry point for manual VACUUM and ANALYZE commands
@@ -2303,16 +2302,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TidStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ tidstore_num_tids(dead_items))));
return istat;
}
@@ -2343,18 +2342,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
@@ -2365,60 +2352,7 @@ vac_max_items_to_alloc_size(int max_items)
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch((void *) itemptr,
- (void *) dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
-
- return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
- BlockNumber lblk,
- rblk;
- OffsetNumber loff,
- roff;
-
- lblk = ItemPointerGetBlockNumber((ItemPointer) left);
- rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
- if (lblk < rblk)
- return -1;
- if (lblk > rblk)
- return 1;
-
- loff = ItemPointerGetOffsetNumber((ItemPointer) left);
- roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
- if (loff < roff)
- return -1;
- if (loff > roff)
- return 1;
+ TidStore *dead_items = (TidStore *) state;
- return 0;
+ return tidstore_lookup_tid(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..4c0ce4b7e6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
* use small integers.
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 3
#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 4
#define PARALLEL_VACUUM_KEY_WAL_USAGE 5
@@ -103,6 +103,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TidStore */
+ tidstore_handle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ dsa_area *dead_items_area;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = tidstore_create(vac_work_mem, dead_items_dsa);
+ pvs->dead_items = dead_items;
+ pvs->dead_items_area = dead_items_dsa;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = tidstore_get_handle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ tidstore_destroy(pvs->dead_items);
+ dsa_detach(pvs->dead_items_area);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TidStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ tidstore_detach(pvs.dead_items);
+ dsa_detach(dead_items_area);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index cbfe329591..4c35af3412 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -188,6 +188,8 @@ static const char *const BuiltinTrancheNames[] = {
"PgStatsHash",
/* LWTRANCHE_PGSTATS_DATA: */
"PgStatsData",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 5025e80f89..edee8a2b2b 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2301,7 +2301,7 @@ struct config_int ConfigureNamesInt[] =
GUC_UNIT_KB
},
&maintenance_work_mem,
- 65536, 1024, MAX_KILOBYTES,
+ 65536, 2048, MAX_KILOBYTES,
NULL, NULL, NULL
},
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
#define PROGRESS_VACUUM_HEAP_BLKS_SCANNED 2
#define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED 3
#define PROGRESS_VACUUM_NUM_INDEX_VACUUMS 4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES 5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES 5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES 6
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..220d89fff7 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
MultiXactId MultiXactCutoff;
};
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TidStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int vac_work_mem,
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 7b7663e2e1..c9b4741e32 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -205,6 +205,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DSA,
LWTRANCHE_PGSTATS_HASH,
LWTRANCHE_PGSTATS_DATA,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
-- ensure we don't use the index in CLUSTER nor the checking SELECTs
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
-- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index a969ae63eb..630869255f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param3 AS heap_blks_scanned,
s.param4 AS heap_blks_vacuumed,
s.param5 AS index_vacuum_count,
- s.param6 AS max_dead_tuples,
- s.param7 AS num_dead_tuples
+ s.param6 AS max_dead_tuple_bytes,
+ s.param7 AS dead_tuple_bytes
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT s.stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
--
2.31.1
v20-0010-Free-all-radix-tree-node-recursively.patchapplication/octet-stream; name=v20-0010-Free-all-radix-tree-node-recursively.patchDownload
From cc2a07008e0eedef43c67c8ef9b55560ce2858b6 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 19 Jan 2023 14:50:37 +0900
Subject: [PATCH v20 10/13] Free all radix tree node recursively.
---
src/include/lib/radixtree.h | 78 +++++++++++++++++++++++++++++++++++++
1 file changed, 78 insertions(+)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 4ed463ba51..fe94335d53 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -127,6 +127,7 @@
#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
#define RT_INIT_NODE RT_MAKE_NAME(init_node)
#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
#define RT_EXTEND RT_MAKE_NAME(extend)
#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
//#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
@@ -1408,6 +1409,78 @@ RT_GET_HANDLE(RT_RADIX_TREE *tree)
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
return tree->ctl->handle;
}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static inline void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();
+
+ /* The leaf node doesn't have child pointers */
+ if (NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->dsa, ptr);
+ return;
+ }
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+ for (int i = 0; i < n4->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n4->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ for (int i = 0; i < n32->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+ }
+
+ break;
+ }
+ }
+
+ /* Free the inner node */
+ dsa_free(tree->dsa, ptr);
+}
#endif
/*
@@ -1419,6 +1492,10 @@ RT_FREE(RT_RADIX_TREE *tree)
#ifdef RT_SHMEM
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ /* Free all memory used for radix tree nodes */
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_FREE_RECURSE(tree, tree->ctl->root);
+
/*
* Vandalize the control block to help catch programming error where
* other backends access the memory formerly occupied by this radix tree.
@@ -2197,6 +2274,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_ALLOC_NODE
#undef RT_INIT_NODE
#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
#undef RT_EXTEND
#undef RT_SET_EXTEND
#undef RT_GROW_NODE_KIND
--
2.31.1
v20-0011-Update-Copyright-and-Identification.patchapplication/octet-stream; name=v20-0011-Update-Copyright-and-Identification.patchDownload
From d458feb13ffa693e635e68592339e4be837f2b2b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 19 Jan 2023 23:49:54 +0900
Subject: [PATCH v20 11/13] Update Copyright and Identification.
---
src/include/lib/radixtree.h | 6 +++---
src/test/modules/test_radixtree/meson.build | 2 +-
src/test/modules/test_radixtree/test_radixtree.c | 2 +-
3 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index fe94335d53..97cccdc9ca 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1,6 +1,6 @@
/*-------------------------------------------------------------------------
*
- * radixtree.c
+ * radixtree.h
* Implementation for adaptive radix tree.
*
* This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
@@ -75,10 +75,10 @@
* RT_DELETE - Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
*
*
- * Copyright (c) 2022, PostgreSQL Global Development Group
+ * Copyright (c) 2023, PostgreSQL Global Development Group
*
* IDENTIFICATION
- * src/backend/lib/radixtree.c
+ * src/include/lib/radixtree.h
*
*-------------------------------------------------------------------------
*/
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
index 72c91d0b7a..6add06bbdb 100644
--- a/src/test/modules/test_radixtree/meson.build
+++ b/src/test/modules/test_radixtree/meson.build
@@ -7,7 +7,7 @@ test_radixtree_sources = files(
if host_system == 'windows'
test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
'--NAME', 'test_radixtree',
- '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+ '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
endif
test_radixtree = shared_module('test_radixtree',
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 4b250be3f9..d8323f587f 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -3,7 +3,7 @@
* test_radixtree.c
* Test radixtree set data structure.
*
- * Copyright (c) 2022, PostgreSQL Global Development Group
+ * Copyright (c) 2023, PostgreSQL Global Development Group
*
* IDENTIFICATION
* src/test/modules/test_radixtree/test_radixtree.c
--
2.31.1
v20-0009-add-link-to-pgport_srv-in-test_radixtree.patchapplication/octet-stream; name=v20-0009-add-link-to-pgport_srv-in-test_radixtree.patchDownload
From cc9e2b8b0614e955231f45bfcedd8cfee1372683 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 19 Jan 2023 09:53:48 +0900
Subject: [PATCH v20 09/13] add link to pgport_srv in test_radixtree.
---
src/test/modules/test_radixtree/meson.build | 1 +
1 file changed, 1 insertion(+)
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
index f96bf159d6..72c91d0b7a 100644
--- a/src/test/modules/test_radixtree/meson.build
+++ b/src/test/modules/test_radixtree/meson.build
@@ -12,6 +12,7 @@ endif
test_radixtree = shared_module('test_radixtree',
test_radixtree_sources,
+ link_with: pgport_srv,
kwargs: pg_mod_args,
)
testprep_targets += test_radixtree
--
2.31.1
v20-0012-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchapplication/octet-stream; name=v20-0012-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload
From 21d455583898f55e2aa24419b35e4ac34cde4377 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v20 12/13] Add TIDStore, to store sets of TIDs
(ItemPointerData) efficiently.
The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.
The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.
This includes a unit test module, in src/test/modules/test_tidstore.
---
doc/src/sgml/monitoring.sgml | 4 +
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 624 ++++++++++++++++++
src/backend/storage/lmgr/lwlock.c | 2 +
src/include/access/tidstore.h | 49 ++
src/include/storage/lwlock.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_tidstore/Makefile | 23 +
.../test_tidstore/expected/test_tidstore.out | 13 +
src/test/modules/test_tidstore/meson.build | 35 +
.../test_tidstore/sql/test_tidstore.sql | 7 +
.../test_tidstore/test_tidstore--1.0.sql | 8 +
.../modules/test_tidstore/test_tidstore.c | 189 ++++++
.../test_tidstore/test_tidstore.control | 4 +
16 files changed, 963 insertions(+)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
create mode 100644 src/test/modules/test_tidstore/Makefile
create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
create mode 100644 src/test/modules/test_tidstore/meson.build
create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
create mode 100644 src/test/modules/test_tidstore/test_tidstore.control
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 358d2ff90f..c9bc091045 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2180,6 +2180,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Waiting to access a shared TID bitmap during a parallel bitmap
index scan.</entry>
</row>
+ <row>
+ <entry><literal>SharedTidStore</literal></entry>
+ <entry>Waiting to access a shared TID store.</entry>
+ </row>
<row>
<entry><literal>SharedTupleStore</literal></entry>
<entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..fa55793227
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,624 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a Tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach(). It can support concurrent updates but only one process
+ * is allowed to iterate over the TidStore at a time.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, item pointers are represented as a pair of 64-bit
+ * key and 64-bit value. First, we construct 64-bit unsigned integer key that
+ * combines the block number and the offset number. The lowest 11 bits represent
+ * the offset number, and the next 32 bits are block number. That is, only 43
+ * bits are used:
+ *
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ *
+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with
+ * the bits, because smaller keys could help keeping the radix tree shallow.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits, and
+ * the rest 37 bits are used as the key:
+ *
+ * value = bitmap representation of XXXXXX
+ * key = XXXXXYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYuu
+ *
+ * The maximum height of the radix tree is 5.
+ *
+ * XXX: if we want to support non-heap table AM that want to use the full
+ * range of possible offset numbers, we'll need to reconsider
+ * TIDSTORE_OFFSET_NBITS value.
+ */
+#define TIDSTORE_OFFSET_NBITS 11
+#define TIDSTORE_VALUE_NBITS 6
+
+/*
+ * Memory consumption depends on the number of Tids stored, but also on the
+ * distribution of them, how the radix tree stores, and the memory management
+ * that backed the radix tree. The maximum bytes that a TidStore can
+ * use is specified by the max_bytes in tidstore_create(). We want the total
+ * amount of memory consumption not to exceed the max_bytes.
+ *
+ * In non-shared cases, the radix tree uses slab allocators for each kind of
+ * node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate the
+ * largest radix tree node in a new slab block, which is approximately 70kB.
+ * Therefore, we deduct 70kB from the maximum bytes.
+ *
+ * In shared cases, DSA allocates the memory segments big enough to follow
+ * a geometric series that approximately doubles the total DSA size (see
+ * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+ * size and the simulation showed, the 75% threshold for the maximum bytes
+ * perfectly works in case where it is a power-of-2, and the 60% threshold
+ * works for other cases.
+ */
+#define TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT (1024L * 70) /* 70kB */
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2 (float) 0.75
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO (float) 0.6
+
+#define KEY_GET_BLKNO(key) \
+ ((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+#define BLKNO_GET_KEY(blkno) \
+ (((uint64) (blkno) << (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+/* The header object for a TidStore */
+typedef struct TidStoreControl
+{
+ /*
+ * 'num_tids' is the number of Tids stored so far. 'max_byte' is the maximum
+ * bytes a TidStore can use. These two fields are commonly used in both
+ * non-shared case and shared case.
+ */
+ uint64 num_tids;
+ uint64 max_bytes;
+
+ /* The below fields are used only in shared case */
+
+ uint32 magic;
+
+ /* protect the shared fields */
+ LWLock lock;
+
+ /* handles for TidStore and radix tree */
+ tidstore_handle handle;
+ shared_rt_handle tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+ /*
+ * Control object. This is allocated in DSA area 'area' in the shared
+ * case, otherwise in backend-local memory.
+ */
+ TidStoreControl *control;
+
+ /* Storage for Tids. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ local_rt_radix_tree *local;
+ shared_rt_radix_tree *shared;
+ } tree;
+
+ /* DSA area for TidStore if used */
+ dsa_area *area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+ TidStore *ts;
+
+ /* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ shared_rt_iter *shared;
+ local_rt_iter *local;
+ } tree_iter;
+
+ /* we returned all tids? */
+ bool finished;
+
+ /* save for the next iteration */
+ uint64 next_key;
+ uint64 next_val;
+
+ /* output for the caller */
+ TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(uint64 max_bytes, dsa_area *area)
+{
+ TidStore *ts;
+
+ ts = palloc0(sizeof(TidStore));
+
+ /*
+ * Create the radix tree for the main storage.
+ */
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+ float ratio = ((max_bytes & (max_bytes - 1)) == 0)
+ ? TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2
+ : TIDSTORE_SHARED_MAX_MEMORY_RATIO;
+
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, area);
+
+ dp = dsa_allocate0(area, sizeof(TidStoreControl));
+ ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+ ts->control->max_bytes =(uint64) (max_bytes * ratio);
+ ts->area = area;
+
+ ts->control->magic = TIDSTORE_MAGIC;
+ LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+ ts->control->handle = dp;
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+ }
+ else
+ {
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+ ts->control->max_bytes = max_bytes - TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT;
+ }
+
+ return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ /* create per-backend state */
+ ts = palloc0(sizeof(TidStore));
+
+ /* Find the control object in shared memory */
+ control = handle;
+
+ /* Set up the TidStore */
+ ts->control = (TidStoreControl *) dsa_get_address(area, control);
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+ ts->area = area;
+
+ return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ shared_rt_detach(ts->tree.shared);
+ pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory. The caller must be certain that
+ * no other backend will attempt to access the TidStore before calling this
+ * function. Other backend must explicitly call tidstore_detach to free up
+ * backend-local memory associated with the TidStore. The backend that calls
+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix
+ * tree.
+ */
+ ts->control->magic = 0;
+ dsa_free(ts->area, ts->control->handle);
+ shared_rt_free(ts->tree.shared);
+ }
+ else
+ {
+ pfree(ts->control);
+ local_rt_free(ts->tree.local);
+ }
+
+ pfree(ts);
+}
+
+/* Forget all collected Tids */
+void
+tidstore_reset(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ if (TidStoreIsShared(ts))
+ {
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /*
+ * Free the radix tree and return allocated DSA segments to
+ * the operating system.
+ */
+ shared_rt_free(ts->tree.shared);
+ dsa_trim(ts->area);
+
+ /* Recreate the radix tree */
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area);
+
+ /* update the radix tree handle as we recreated it */
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+
+ LWLockRelease(&ts->control->lock);
+ }
+ else
+ {
+ local_rt_free(ts->tree.local);
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+ }
+}
+
+static inline void
+tidstore_insert_kv(TidStore *ts, uint64 key, uint64 val)
+{
+ if (TidStoreIsShared(ts))
+ {
+ /*
+ * Since the shared radix tree supports concurrent insert,
+ * we don't need to acquire the lock.
+ */
+ shared_rt_set(ts->tree.shared, key, val);
+ }
+ else
+ local_rt_set(ts->tree.local, key, val);
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+#define NUM_KEYS_PER_BLOCK (1 << (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS))
+ ItemPointerData tid;
+ uint64 key_base;
+ uint64 values[NUM_KEYS_PER_BLOCK] = {0};
+
+ ItemPointerSetBlockNumber(&tid, blkno);
+ key_base = BLKNO_GET_KEY(blkno);
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint64 key;
+ uint32 off;
+ int idx;
+
+ ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+ /* encode the Tid to key and val */
+ key = tid_to_key_off(&tid, &off);
+
+ idx = key - key_base;
+ Assert(idx >= 0 && idx < NUM_KEYS_PER_BLOCK);
+
+ values[idx] |= UINT64CONST(1) << off;
+ }
+
+ /* insert the calculated key-values to the tree */
+ for (int i = 0; i < NUM_KEYS_PER_BLOCK; i++)
+ {
+ if (values[i])
+ {
+ uint64 key = key_base + i;
+
+ tidstore_insert_kv(ts, key, values[i]);
+ }
+ }
+
+ if (TidStoreIsShared(ts))
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /* update statistics */
+ ts->control->num_tids += num_offsets;
+
+ if (TidStoreIsShared(ts))
+ LWLockRelease(&ts->control->lock);
+}
+
+/* Return true if the given Tid is present in TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val;
+ uint32 off;
+ bool found;
+
+ key = tid_to_key_off(tid, &off);
+
+ found = TidStoreIsShared(ts) ?
+ shared_rt_search(ts->tree.shared, key, &val) :
+ local_rt_search(ts->tree.local, key, &val);
+
+ if (!found)
+ return false;
+
+ return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. The caller must be certain that
+ * no other backend will attempt to update the TidStore during the iteration.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+ TidStoreIter *iter;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ iter = palloc0(sizeof(TidStoreIter));
+ iter->ts = ts;
+ iter->result.blkno = InvalidBlockNumber;
+
+ if (TidStoreIsShared(ts))
+ iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+ else
+ iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+ /* If the TidStore is empty, there is no business */
+ if (tidstore_num_tids(ts) == 0)
+ iter->finished = true;
+
+ return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+ else
+ return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a TidStoreIterResult representing Tids
+ * in one page. Offset numbers in the result is sorted.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+ TidStoreIterResult *result = &(iter->result);
+
+ if (iter->finished)
+ return NULL;
+
+ if (BlockNumberIsValid(result->blkno))
+ {
+ result->num_offsets = 0;
+ tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (tidstore_iter_kv(iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = KEY_GET_BLKNO(key);
+
+ if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+ {
+ /*
+ * Remember the key-value pair for the next block for the
+ * next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+ return result;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_extract_tids(iter, key, val);
+ }
+
+ iter->finished = true;
+ return result;
+}
+
+/* Finish an iteration over TidStore */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+ if (TidStoreIsShared(iter->ts))
+ shared_rt_end_iterate(iter->tree_iter.shared);
+ else
+ local_rt_end_iterate(iter->tree_iter.local);
+
+ pfree(iter);
+}
+
+/* Return the number of Tids we collected so far */
+uint64
+tidstore_num_tids(TidStore *ts)
+{
+ uint64 num_tids;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ if (TidStoreIsShared(ts))
+ return ts->control->num_tids;
+
+ LWLockAcquire(&ts->control->lock, LW_SHARED);
+ num_tids = ts->control->num_tids;
+ LWLockRelease(&ts->control->lock);
+
+ return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+uint64
+tidstore_max_memory(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+uint64
+tidstore_memory_usage(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return (uint64) sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+ else
+ return (uint64) sizeof(TidStore) + sizeof(TidStore) +
+ local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->handle;
+}
+
+/* Extract Tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+ TidStoreIterResult *result = (&iter->result);
+
+ for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ if ((val & (UINT64CONST(1) << i)) == 0)
+ continue;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= i;
+
+ off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+ result->offsets[result->num_offsets++] = off;
+ }
+
+ result->blkno = KEY_GET_BLKNO(key);
+}
+
+/*
+ * Encode a Tid to key and val.
+ */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint64 tid_i;
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+ *off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+ upper = tid_i >> TIDSTORE_VALUE_NBITS;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ return upper;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 196bece0a3..cbfe329591 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
"SharedTupleStore",
/* LWTRANCHE_SHARED_TIDBITMAP: */
"SharedTidBitmap",
+ /* LWTRANCHE_SHARED_TIDSTORE: */
+ "SharedTidStore",
/* LWTRANCHE_PARALLEL_APPEND: */
"ParallelAppend",
/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..ec3d9f87f5
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+ BlockNumber blkno;
+ OffsetNumber offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+ int num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(uint64 max_bytes, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern uint64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern uint64 tidstore_max_memory(TidStore *ts);
+extern uint64 tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif /* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index e4162db613..7b7663e2e1 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
LWTRANCHE_SHARED_TUPLESTORE,
LWTRANCHE_SHARED_TIDBITMAP,
+ LWTRANCHE_SHARED_TIDSTORE,
LWTRANCHE_PARALLEL_APPEND,
LWTRANCHE_PER_XACT_PREDICATE_LIST,
LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
test_rls_hooks \
test_shm_mq \
test_slru \
+ test_tidstore \
unsafe_tests \
worker_spi
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
subdir('test_rls_hooks')
subdir('test_shm_mq')
subdir('test_slru')
+subdir('test_tidstore')
subdir('unsafe_tests')
subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+ $(WIN32RES) \
+ test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE: testing empty tidstore
+NOTICE: testing basic operations
+ test_tidstore
+---------------
+
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+ 'test_tidstore.c',
+)
+
+if host_system == 'windows'
+ test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_tidstore',
+ '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+ test_tidstore_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+ 'test_tidstore.control',
+ 'test_tidstore--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_tidstore',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_tidstore',
+ ],
+ },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..5d38387450
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,189 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ * Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+ ItemPointerData tid;
+ bool found;
+
+ ItemPointerSet(&tid, blkno, off);
+
+ found = tidstore_lookup_tid(ts, &tid);
+
+ if (found != expect)
+ elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+ blkno, off, found, expect);
+}
+
+static void
+test_basic(void)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS 5
+#define TEST_TIDSTORE_NUM_OFFSETS 11
+#define IS_POWER_OF_TWO(x) (((x) & (x - 1)) == 0)
+
+ TidStore *ts;
+ TidStoreIter *iter;
+ TidStoreIterResult *iter_result;
+ BlockNumber blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+ };
+ BlockNumber blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+ };
+ OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS] = {
+ 1 << 5, 1 << 6, 1 << 7, 1 << 8, 1 << 9,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3, 1 << 4,
+ 1 << 10
+ };
+ OffsetNumber offs_sorted[TEST_TIDSTORE_NUM_OFFSETS] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3, 1 << 4,
+ 1 << 5, 1 << 6, 1 << 7, 1 << 8, 1 << 9,
+ 1 << 10
+ };
+ int blk_idx;
+
+ elog(NOTICE, "testing basic operations");
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, NULL);
+
+ /* add tids */
+ for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+ tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* lookup test */
+ for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+ off++)
+ {
+ check_tid(ts, 0, off, IS_POWER_OF_TWO(off));
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, IS_POWER_OF_TWO(off));
+ }
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+ elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+ tidstore_num_tids(ts),
+ TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* iteration test */
+ iter = tidstore_begin_iterate(ts);
+ blk_idx = 0;
+ while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+ {
+ /* check the returned block number */
+ if (blks_sorted[blk_idx] != iter_result->blkno)
+ elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+ iter_result->blkno, blks_sorted[blk_idx]);
+
+ /* check the returned offset numbers */
+ if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+ elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+ iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+ for (int i = 0; i < iter_result->num_offsets; i++)
+ {
+ if (offs_sorted[i] != iter_result->offsets[i])
+ elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+ iter_result->offsets[i], iter_result->blkno,
+ offs_sorted[i]);
+ }
+
+ blk_idx++;
+ }
+
+ if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+ elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+ blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+ /* remove all tids */
+ tidstore_reset(ts);
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+ /* lookup test for empty store */
+ for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+ off++)
+ {
+ check_tid(ts, 0, off, false);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, false);
+ }
+
+ tidstore_destroy(ts);
+}
+
+static void
+test_empty(void)
+{
+ TidStore *ts;
+ TidStoreIter *iter;
+ ItemPointerData tid;
+
+ elog(NOTICE, "testing empty tidstore");
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, NULL);
+
+ ItemPointerSet(&tid, 0, FirstOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+ ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+ MaxBlockNumber, MaxOffsetNumber);
+
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+ if (tidstore_is_full(ts))
+ elog(ERROR, "tidstore_is_full on empty store returned true");
+
+ iter = tidstore_begin_iterate(ts);
+
+ if (tidstore_iterate_next(iter) != NULL)
+ elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+ tidstore_end_iterate(iter);
+
+ tidstore_destroy(ts);
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+ test_empty();
+ test_basic();
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
--
2.31.1
v20-0005-Shared-memory-cleanups.patchapplication/octet-stream; name=v20-0005-Shared-memory-cleanups.patchDownload
From 071f8c13f5eb18d2d7449dfe5457d27a753b0528 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 17 Jan 2023 16:50:38 +0700
Subject: [PATCH v20 05/13] Shared memory cleanups
---
src/include/lib/radixtree.h | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index a78079b896..345b37e5fb 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1378,8 +1378,6 @@ RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
- /* XXX: do we need to set a callback on exit to detach dsa? */
-
return tree;
}
@@ -1412,8 +1410,7 @@ RT_FREE(RT_RADIX_TREE *tree)
* other backends access the memory formerly occupied by this radix tree.
*/
tree->ctl->magic = 0;
- dsa_free(tree->dsa, tree->ctl->handle); // XXX
- //dsa_detach(tree->dsa);
+ dsa_free(tree->dsa, tree->ctl->handle);
#else
pfree(tree->ctl);
@@ -1452,8 +1449,6 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
if (key > tree->ctl->max_val)
RT_EXTEND(tree, key);
- //Assert(tree->ctl->root);
-
nodep = tree->ctl->root;
parent = RT_PTR_GET_LOCAL(tree, nodep);
shift = parent->shift;
--
2.31.1
v20-0006-Make-RT_DELETE-optional.patchapplication/octet-stream; name=v20-0006-Make-RT_DELETE-optional.patchDownload
From 5d225cecf001837617b2eab36c96fecf2deb6af7 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 17 Jan 2023 17:34:28 +0700
Subject: [PATCH v20 06/13] Make RT_DELETE optional
To prevent compiler warnings in TIDStore
---
src/include/lib/radixtree.h | 16 +++++++++++++++-
src/test/modules/test_radixtree/test_radixtree.c | 1 +
2 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 345b37e5fb..5bdfa74f72 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -58,7 +58,6 @@
* RT_GET_HANDLE - Return the handle of the radix tree
* RT_SEARCH - Search a key-value pair
* RT_SET - Set a key-value pair
- * RT_DELETE - Delete a key-value pair
* RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
* RT_ITERATE_NEXT - Return next key-value pair, if any
* RT_END_ITER - End iteration
@@ -70,6 +69,12 @@
* RT_ITERATE_NEXT() ensures returning key-value pairs in the ascending
* order of the key.
*
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE - Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
* Copyright (c) 2022, PostgreSQL Global Development Group
*
* IDENTIFICATION
@@ -106,7 +111,9 @@
#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
#define RT_DUMP RT_MAKE_NAME(dump)
#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
@@ -213,7 +220,9 @@ RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
+#ifdef RT_USE_DELETE
RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
@@ -1264,6 +1273,7 @@ RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, uint64 *value_p)
#undef RT_NODE_LEVEL_LEAF
}
+#ifdef RT_USE_DELETE
/*
* Search for the child pointer corresponding to 'key' in the given node.
*
@@ -1289,6 +1299,7 @@ RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
#include "lib/radixtree_delete_impl.h"
#undef RT_NODE_LEVEL_LEAF
}
+#endif
/* Insert the child to the inner node */
static bool
@@ -1523,6 +1534,7 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
return RT_NODE_SEARCH_LEAF(node, key, value_p);
}
+#ifdef RT_USE_DELETE
/*
* Delete the given key from the radix tree. Return true if the key is found (and
* deleted), otherwise do nothing and return false.
@@ -1609,6 +1621,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
return true;
}
+#endif
static inline void
RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
@@ -2168,6 +2181,7 @@ rt_dump(RT_RADIX_TREE *tree)
#undef RT_BEGIN_ITERATE
#undef RT_ITERATE_NEXT
#undef RT_END_ITERATE
+#undef RT_USE_DELETE
#undef RT_DELETE
#undef RT_MEMORY_USAGE
#undef RT_DUMP
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 076173f628..f01d4dd733 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -104,6 +104,7 @@ static const test_spec test_specs[] = {
#define RT_SCOPE static
#define RT_DECLARE
#define RT_DEFINE
+#define RT_USE_DELETE
// WIP: compiles with warnings because rt_attach is defined but not used
// #define RT_SHMEM
#include "lib/radixtree.h"
--
2.31.1
v20-0008-Fix-bug-in-RT_CHUNK_VALUES_ARRAY_SHIFT.patchapplication/octet-stream; name=v20-0008-Fix-bug-in-RT_CHUNK_VALUES_ARRAY_SHIFT.patchDownload
From 58e98149a77ae23548c8d0fb3f3d229496ea1d9e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 19 Jan 2023 23:49:33 +0900
Subject: [PATCH v20 08/13] Fix bug in RT_CHUNK_VALUES_ARRAY_SHIFT().
---
src/include/lib/radixtree.h | 2 +-
src/test/modules/test_radixtree/test_radixtree.c | 12 ++++++++++++
2 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index b9e09f5761..4ed463ba51 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -830,7 +830,7 @@ static inline void
RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, uint64 *values, int count, int idx)
{
memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
- memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64) * (count - idx));
}
/* Delete the element at 'idx' */
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index f01d4dd733..4b250be3f9 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -206,6 +206,18 @@ test_basic(int children, bool test_inner)
elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
}
+ /* look up keys */
+ for (int i = 0; i < children; i++)
+ {
+ uint64 value;
+
+ if (!rt_search(radixtree, keys[i], &value))
+ elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (value != keys[i])
+ elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+ value, keys[i]);
+ }
+
/* update keys */
for (int i = 0; i < children; i++)
{
--
2.31.1
v20-0007-Fix-RT_DEBUG-functions.patchapplication/octet-stream; name=v20-0007-Fix-RT_DEBUG-functions.patchDownload
From 5c27ee115383d257d8d4f2280ad200e40dc36ceb Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 19 Jan 2023 23:29:16 +0900
Subject: [PATCH v20 07/13] Fix RT_DEBUG functions.
---
src/include/lib/radixtree.h | 30 +++++++++++++++++-------------
1 file changed, 17 insertions(+), 13 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 5bdfa74f72..b9e09f5761 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -115,9 +115,12 @@
#define RT_DELETE RT_MAKE_NAME(delete)
#endif
#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
#define RT_STATS RT_MAKE_NAME(stats)
+#endif
/* internal helper functions (no externally visible prototypes) */
#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
@@ -1876,8 +1879,8 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
/***************** DEBUG FUNCTIONS *****************/
#ifdef RT_DEBUG
-void
-rt_stats(RT_RADIX_TREE *tree)
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
{
ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
tree->ctl->num_keys,
@@ -1890,7 +1893,7 @@ rt_stats(RT_RADIX_TREE *tree)
}
static void
-rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
+RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
{
char space[125] = {0};
@@ -1926,7 +1929,7 @@ rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
space, n4->base.chunks[i]);
if (recurse)
- rt_dump_node(n4->children[i], level + 1, recurse);
+ RT_DUMP_NODE(n4->children[i], level + 1, recurse);
else
fprintf(stderr, "\n");
}
@@ -1953,7 +1956,7 @@ rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
if (recurse)
{
- rt_dump_node(n32->children[i], level + 1, recurse);
+ RT_DUMP_NODE(n32->children[i], level + 1, recurse);
}
else
fprintf(stderr, "\n");
@@ -2005,7 +2008,7 @@ rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(RT_NODE_INNER_125_GET_CHILD(n125, i),
+ RT_DUMP_NODE(RT_NODE_INNER_125_GET_CHILD(n125, i),
level + 1, recurse);
else
fprintf(stderr, "\n");
@@ -2038,7 +2041,7 @@ rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
space, i);
if (recurse)
- rt_dump_node(RT_NODE_INNER_256_GET_CHILD(n256, i), level + 1,
+ RT_DUMP_NODE(RT_NODE_INNER_256_GET_CHILD(n256, i), level + 1,
recurse);
else
fprintf(stderr, "\n");
@@ -2049,8 +2052,8 @@ rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
}
}
-void
-rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
{
RT_PTR_LOCAL node;
int shift;
@@ -2079,7 +2082,7 @@ rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
{
RT_PTR_LOCAL child;
- rt_dump_node(node, level, false);
+ RT_DUMP_NODE(node, level, false);
if (NODE_IS_LEAF(node))
{
@@ -2100,8 +2103,8 @@ rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
}
}
-void
-rt_dump(RT_RADIX_TREE *tree)
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
{
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
@@ -2119,7 +2122,7 @@ rt_dump(RT_RADIX_TREE *tree)
return;
}
- rt_dump_node(tree->ctl->root, 0, true);
+ RT_DUMP_NODE(tree->ctl->root, 0, true);
}
#endif
@@ -2185,6 +2188,7 @@ rt_dump(RT_RADIX_TREE *tree)
#undef RT_DELETE
#undef RT_MEMORY_USAGE
#undef RT_DUMP
+#undef RT_DUMP_NODE
#undef RT_DUMP_SEARCH
#undef RT_STATS
--
2.31.1
v20-0004-Remove-RT_NUM_ENTRIES.patchapplication/octet-stream; name=v20-0004-Remove-RT_NUM_ENTRIES.patchDownload
From 97a647cd9486f58b9186e6dc46fd0afdf474dfd9 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 17 Jan 2023 16:38:09 +0700
Subject: [PATCH v20 04/13] Remove RT_NUM_ENTRIES
This is not expected to be used everywhere, and is very simple
to implement, so move definition to test module where it is
used extensively.
---
src/include/lib/radixtree.h | 13 -------------
src/test/modules/test_radixtree/test_radixtree.c | 9 +++++++++
2 files changed, 9 insertions(+), 13 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 9f8bed09f7..a78079b896 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -63,7 +63,6 @@
* RT_ITERATE_NEXT - Return next key-value pair, if any
* RT_END_ITER - End iteration
* RT_MEMORY_USAGE - Get the memory usage
- * RT_NUM_ENTRIES - Get the number of key-value pairs
*
* RT_CREATE() creates an empty radix tree in the given memory context
* and memory contexts for all kinds of radix tree node under the memory context.
@@ -109,7 +108,6 @@
#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
#define RT_DELETE RT_MAKE_NAME(delete)
#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
-#define RT_NUM_ENTRIES RT_MAKE_NAME(num_entries)
#define RT_DUMP RT_MAKE_NAME(dump)
#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
#define RT_STATS RT_MAKE_NAME(stats)
@@ -222,7 +220,6 @@ RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
-RT_SCOPE uint64 RT_NUM_ENTRIES(RT_RADIX_TREE *tree);
#ifdef RT_DEBUG
RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
@@ -1773,15 +1770,6 @@ RT_END_ITERATE(RT_ITER *iter)
pfree(iter);
}
-/*
- * Return the number of keys in the radix tree.
- */
-RT_SCOPE uint64
-RT_NUM_ENTRIES(RT_RADIX_TREE *tree)
-{
- return tree->ctl->num_keys;
-}
-
/*
* Return the statistics of the amount of memory used by the radix tree.
*/
@@ -2187,7 +2175,6 @@ rt_dump(RT_RADIX_TREE *tree)
#undef RT_END_ITERATE
#undef RT_DELETE
#undef RT_MEMORY_USAGE
-#undef RT_NUM_ENTRIES
#undef RT_DUMP
#undef RT_DUMP_SEARCH
#undef RT_STATS
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 61d842789d..076173f628 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -109,6 +109,15 @@ static const test_spec test_specs[] = {
#include "lib/radixtree.h"
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
PG_MODULE_MAGIC;
PG_FUNCTION_INFO_V1(test_radixtree);
--
2.31.1
v20-0003-Add-radixtree-template.patchapplication/octet-stream; name=v20-0003-Add-radixtree-template.patchDownload
From a81afc05faabfc4f2d49cb93cf5867032100a535 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v20 03/13] Add radixtree template
The only thing configurable at this point is function scope,
prefix, and local/shared memory.
The key and value type are still hard-coded to uint64.
To make this more useful, at least value type should be
configurable.
It might be good at some point to offer a different tree type,
e.g. "single-value leaves" to allow for variable length keys
and values, giving full flexibility to developers.
TODO: Reducing the smallest node to 3 members will
eliminate padding and only take up 32 bytes for
inner nodes.
---
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 2243 +++++++++++++++++
src/include/lib/radixtree_delete_impl.h | 106 +
src/include/lib/radixtree_insert_impl.h | 316 +++
src/include/lib/radixtree_iter_impl.h | 138 +
src/include/lib/radixtree_search_impl.h | 131 +
src/include/utils/dsa.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 36 +
src/test/modules/test_radixtree/meson.build | 34 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 631 +++++
.../test_radixtree/test_radixtree.control | 4 +
src/tools/pginclude/cpluspluscheck | 6 +
src/tools/pginclude/headerscheck | 6 +
20 files changed, 3715 insertions(+)
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/include/lib/radixtree_delete_impl.h
create mode 100644 src/include/lib/radixtree_insert_impl.h
create mode 100644 src/include/lib/radixtree_iter_impl.h
create mode 100644 src/include/lib/radixtree_search_impl.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 604b702a91..50f0aae3ab 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..9f8bed09f7
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2243 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves". We
+ * choose it to avoid an additional pointer traversal. It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included. Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * will result in radix tree type 'foo_radix_tree' and functions like
+ * 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ * generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ * declarations reside
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ * so that multiple processes can access it simultaneously.
+ *
+ * Optional parameters:
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE - Create a new, empty radix tree
+ * RT_FREE - Free the radix tree
+ * RT_ATTACH - Attach to the radix tree
+ * RT_DETACH - Detach from the radix tree
+ * RT_GET_HANDLE - Return the handle of the radix tree
+ * RT_SEARCH - Search a key-value pair
+ * RT_SET - Set a key-value pair
+ * RT_DELETE - Delete a key-value pair
+ * RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT - Return next key-value pair, if any
+ * RT_END_ITER - End iteration
+ * RT_MEMORY_USAGE - Get the memory usage
+ * RT_NUM_ENTRIES - Get the number of key-value pairs
+ *
+ * RT_CREATE() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * RT_ITERATE_NEXT() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#define RT_DELETE RT_MAKE_NAME(delete)
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#define RT_NUM_ENTRIES RT_MAKE_NAME(num_entries)
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+//#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_4_SEARCH_EQ RT_MAKE_NAME(node_4_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_4_GET_INSERTPOS RT_MAKE_NAME(node_4_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_4 RT_MAKE_NAME(node_inner_4)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_4 RT_MAKE_NAME(node_leaf_4)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_4_FULL RT_MAKE_NAME(class_4_full)
+#define RT_CLASS_32_PARTIAL RT_MAKE_NAME(class_32_partial)
+#define RT_CLASS_32_FULL RT_MAKE_NAME(class_32_full)
+#define RT_CLASS_125_FULL RT_MAKE_NAME(class_125_full)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+#define RT_KIND_MIN_SIZE_CLASS RT_MAKE_NAME(kind_min_size_class)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+RT_SCOPE uint64 RT_NUM_ENTRIES(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif /* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* macros and types common to all implementations */
+#ifndef RT_COMMON
+#define RT_COMMON
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
+#define BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of RT_NODE. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+#endif /* RT_COMMON */
+
+
+typedef enum RT_SIZE_CLASS
+{
+ RT_CLASS_4_FULL = 0,
+ RT_CLASS_32_PARTIAL,
+ RT_CLASS_32_FULL,
+ RT_CLASS_125_FULL,
+ RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Max capacity for the current size class. Storing this in the
+ * node enables multiple size classes per node kind.
+ * Technically, kinds with a single size class don't need this, so we could
+ * keep this in the individual base types, but the code is simpler this way.
+ * Note: node256 is unique in that it cannot possibly have more than a
+ * single size class, so for that kind we store zero, and uint8 is
+ * sufficient for other kinds.
+ */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#define NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
+#define NODE_IS_EMPTY(n) (((RT_PTR_LOCAL) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+ ((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+ ((node)->base.n.count < RT_SIZE_CLASS_INFO[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct RT_NODE_BASE_4
+{
+ RT_NODE n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+} RT_NODE_BASE_4;
+
+typedef struct RT_NODE_BASE_32
+{
+ RT_NODE n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct RT_NODE_BASE_125
+{
+ RT_NODE n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[BM_IDX(128)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+ RT_NODE n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ * width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct RT_NODE_INNER_4
+{
+ RT_NODE_BASE_4 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_4;
+
+typedef struct RT_NODE_LEAF_4
+{
+ RT_NODE_BASE_4 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_4;
+
+typedef struct RT_NODE_INNER_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct RT_NODE_INNER_256
+{
+ RT_NODE_BASE_256 base;
+
+ /* Slots for 256 children */
+ RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+ RT_NODE_BASE_256 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[BM_IDX(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ uint64 values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+
+ /* slab block size */
+ Size inner_blocksize;
+ Size leaf_blocksize;
+} RT_SIZE_CLASS_ELEM;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+ [RT_CLASS_4_FULL] = {
+ .name = "radix tree node 4",
+ .fanout = 4,
+ .inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_PARTIAL] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_FULL] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64)),
+ },
+ [RT_CLASS_125_FULL] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64)),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(RT_NODE_INNER_256),
+ .leaf_size = sizeof(RT_NODE_LEAF_256),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_256)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_256)),
+ },
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+/* Map from the node kind to its minimum size class */
+static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
+ [RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+ [RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+ [RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+ [RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+ RT_HANDLE handle;
+ uint32 magic;
+#endif
+
+ RT_PTR_ALLOC root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE
+{
+ MemoryContext context;
+
+ /* pointing to either local memory or DSA */
+ RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+ dsa_area *dsa;
+#else
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
++ *
++ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
++ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
++ * We need either a safeguard to disallow other processes to begin the iteration
++ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+ RT_PTR_LOCAL node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+ RT_RADIX_TREE *tree;
+
+ /* Track the iteration on nodes of each level */
+ RT_NODE_ITER stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+} RT_ITER;
+
+
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+ uint64 key, uint64 value);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+ return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+ return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+ return DsaPointerIsValid(ptr);
+#else
+ return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+RT_NODE_4_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+RT_NODE_4_GET_INSERTPOS(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+ uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, uint64 *src_values,
+ uint8 *dst_chunks, uint64 *dst_values)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(uint64) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+ return node->children[chunk];
+}
+
+static inline uint64
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, uint64 value)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[idx] |= ((bitmapword) 1 << bitnum);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
+{
+ RT_PTR_ALLOC allocnode;
+ size_t allocsize;
+
+ if (inner)
+ allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+ else
+ allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+
+#ifdef RT_SHMEM
+ allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+ if (inner)
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+ allocsize);
+ else
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ allocsize);
+#endif
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->ctl->cnt[size_class]++;
+#endif
+
+ return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner)
+{
+ if (inner)
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+ else
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+
+ node->kind = kind;
+
+ if (kind == RT_NODE_KIND_256)
+ /* See comment for the RT_NODE type */
+ Assert(node->fanout == 0);
+ else
+ node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+ memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+ }
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+ int shift = RT_KEY_GET_SHIFT(key);
+ bool inner = shift > 0;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newnode->shift = shift;
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+ tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->count = oldnode->count;
+}
+#if 0
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static RT_NODE*
+RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_LOCAL node, uint8 new_kind)
+{
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ bool inner = !NODE_IS_LEAF(node);
+
+ allocnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+ RT_COPY_NODE(newnode, node);
+
+ return newnode;
+}
+#endif
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->ctl->root == allocnode)
+ {
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+ tree->ctl->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->ctl->cnt[i]--;
+ Assert(tree->ctl->cnt[i] >= 0);
+ }
+#endif
+
+#ifdef RT_SHMEM
+ dsa_free(tree->dsa, allocnode);
+#else
+ pfree(allocnode);
+#endif
+}
+
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
+ RT_PTR_ALLOC new_child, uint64 key)
+{
+ RT_PTR_LOCAL old = RT_PTR_GET_LOCAL(tree, old_child);
+
+#ifdef USE_ASSERT_CHECKING
+ RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+ Assert(old->shift == new->shift);
+#endif
+
+ if (parent == old)
+ {
+ /* Replace the root node with the new large node */
+ tree->ctl->root = new_child;
+ }
+ else
+ RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+ RT_FREE_NODE(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+ int target_shift;
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ int shift = root->shift + RT_NODE_SPAN;
+
+ target_shift = RT_KEY_GET_SHIFT(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ RT_NODE_INNER_4 *n4;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+ node->shift = shift;
+ node->count = 1;
+
+ n4 = (RT_NODE_INNER_4 *) node;
+ n4->base.chunks[0] = 0;
+ n4->children[0] = tree->ctl->root;
+
+ /* Update the root */
+ tree->ctl->root = allocnode;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC nodep, RT_PTR_LOCAL node)
+{
+ int shift = node->shift;
+
+ Assert(RT_PTR_GET_LOCAL(tree, nodep) == node);
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ RT_PTR_ALLOC allocchild;
+ RT_PTR_LOCAL newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool inner = newshift > 0;
+
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newchild->shift = newshift;
+ RT_NODE_INSERT_INNER(tree, parent, nodep, node, key, allocchild);
+
+ parent = node;
+ node = newchild;
+ nodep = allocchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+ tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/* Insert the child to the inner node */
+static bool
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Insert the value to the leaf node */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+ uint64 key, uint64 value)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+ RT_RADIX_TREE *tree;
+ MemoryContext old_ctx;
+#ifdef RT_SHMEM
+ dsa_pointer dp;
+#endif
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+ tree->context = ctx;
+
+#ifdef RT_SHMEM
+ tree->dsa = dsa;
+ dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+ tree->ctl->handle = dp;
+ tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+#else
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ RT_SIZE_CLASS_INFO[i].name,
+ RT_SIZE_CLASS_INFO[i].inner_blocksize,
+ RT_SIZE_CLASS_INFO[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ RT_SIZE_CLASS_INFO[i].name,
+ RT_SIZE_CLASS_INFO[i].leaf_blocksize,
+ RT_SIZE_CLASS_INFO[i].leaf_size);
+ }
+#endif
+
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+ RT_RADIX_TREE *tree;
+ dsa_pointer control;
+
+ /* XXX: memory context support */
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ tree->dsa = dsa;
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ /* XXX: do we need to set a callback on exit to detach dsa? */
+
+ return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ return tree->ctl->handle;
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->dsa, tree->ctl->handle); // XXX
+ //dsa_detach(tree->dsa);
+#else
+ pfree(tree->ctl);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+#endif
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
+{
+ int shift;
+ bool updated;
+ RT_PTR_LOCAL parent;
+ RT_PTR_ALLOC nodep;
+ RT_PTR_LOCAL node;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ /* Empty tree, create the root */
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_NEW_ROOT(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->ctl->max_val)
+ RT_EXTEND(tree, key);
+
+ //Assert(tree->ctl->root);
+
+ nodep = tree->ctl->root;
+ parent = RT_PTR_GET_LOCAL(tree, nodep);
+ shift = parent->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child;
+
+ node = RT_PTR_GET_LOCAL(tree, nodep);
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_SET_EXTEND(tree, key, value, parent, nodep, node);
+ return false;
+ }
+
+ parent = node;
+ nodep = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->ctl->num_keys++;
+
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+ Assert(value_p != NULL);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ return false;
+
+ node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ shift = node->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ return false;
+
+ node = RT_PTR_GET_LOCAL(tree, child);
+ shift -= RT_NODE_SPAN;
+ }
+
+ return RT_NODE_SEARCH_LEAF(node, key, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ return false;
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ RT_PTR_ALLOC child;
+
+ /* Push the current node to the stack */
+ stack[++level] = allocnode;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ return false;
+
+ allocnode = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_LEAF(node, key);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->ctl->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (!NODE_IS_EMPTY(node))
+ return true;
+
+ /* Free the empty leaf node */
+ RT_FREE_NODE(tree, allocnode);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ allocnode = stack[level--];
+
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_INNER(node, key);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!NODE_IS_EMPTY(node))
+ break;
+
+ /* The node became empty */
+ RT_FREE_NODE(tree, allocnode);
+ }
+
+ return true;
+}
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+ uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+ int level = from;
+ RT_PTR_LOCAL node = from_node;
+
+ for (;;)
+ {
+ RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/* Create and return the iterator for the given radix tree */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+ MemoryContext old_ctx;
+ RT_ITER *iter;
+ RT_PTR_LOCAL root;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree->ctl->root)
+ return iter;
+
+ root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+ top_level = root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->ctl->root)
+ return false;
+
+ for (;;)
+ {
+ RT_PTR_LOCAL child = NULL;
+ uint64 value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+ pfree(iter);
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+RT_SCOPE uint64
+RT_NUM_ENTRIES(RT_RADIX_TREE *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+ // XXX is this necessary?
+ Size total = sizeof(RT_RADIX_TREE);
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ total = dsa_get_total_size(tree->dsa);
+#else
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+#endif
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE_BASE_4 *n4 = (RT_NODE_BASE_4 *) node;
+
+ for (int i = 1; i < n4->n.count; i++)
+ Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ uint8 slot = n125->slot_idxs[i];
+ int idx = BM_IDX(slot);
+ int bitnum = BM_BIT(slot);
+
+ if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(slot < node->fanout);
+ Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
+ cnt += bmw_popcount(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(RT_RADIX_TREE *tree)
+{
+ ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+ tree->ctl->num_keys,
+ tree->ctl->root->shift / RT_NODE_SPAN,
+ tree->ctl->cnt[RT_CLASS_4_FULL],
+ tree->ctl->cnt[RT_CLASS_32_PARTIAL],
+ tree->ctl->cnt[RT_CLASS_32_FULL],
+ tree->ctl->cnt[RT_CLASS_125_FULL],
+ tree->ctl->cnt[RT_CLASS_256])));
+}
+
+static void
+rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
+{
+ char space[125] = {0};
+
+ fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
+ NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift);
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n4->base.chunks[i], n4->values[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n4->base.chunks[i]);
+
+ if (recurse)
+ rt_dump_node(n4->children[i], level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n32->base.chunks[i], n32->values[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ rt_dump_node(n32->children[i], level + 1, recurse);
+ }
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+
+ fprintf(stderr, "slot_idxs ");
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+ }
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
+
+ fprintf(stderr, ", isset-bitmap:");
+ for (int i = 0; i < BM_IDX(128); i++)
+ {
+ fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
+ }
+ fprintf(stderr, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+ }
+ else
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(RT_NODE_INNER_125_GET_CHILD(n125, i),
+ level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, RT_NODE_LEAF_256_GET_VALUE(n256, i));
+ }
+ else
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ rt_dump_node(RT_NODE_INNER_256_GET_CHILD(n256, i), level + 1,
+ recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+void
+rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+ tree->ctl->max_val, tree->ctl->max_val);
+
+ if (!tree->ctl->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->ctl->max_val)
+ {
+ elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+ key, key);
+ return;
+ }
+
+ node = tree->ctl->root;
+ shift = tree->ctl->root->shift;
+ while (shift >= 0)
+ {
+ RT_PTR_LOCAL child;
+
+ rt_dump_node(node, level, false);
+
+ if (NODE_IS_LEAF(node))
+ {
+ uint64 dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+ break;
+ }
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+}
+
+void
+rt_dump(RT_RADIX_TREE *tree)
+{
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+ RT_SIZE_CLASS_INFO[i].name,
+ RT_SIZE_CLASS_INFO[i].inner_size,
+ RT_SIZE_CLASS_INFO[i].inner_blocksize,
+ RT_SIZE_CLASS_INFO[i].leaf_size,
+ RT_SIZE_CLASS_INFO[i].leaf_blocksize);
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+
+ if (!tree->ctl->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ rt_dump_node(tree->ctl->root, 0, true);
+}
+#endif
+
+#endif /* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+
+/* locally declared macros */
+#undef NODE_IS_LEAF
+#undef NODE_IS_EMPTY
+#undef VAR_NODE_HAS_FREE_SLOT
+#undef FIXED_NODE_HAS_FREE_SLOT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_RADIX_TREE_MAGIC
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_BASE_4
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_4
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_4
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_4_FULL
+#undef RT_CLASS_32_PARTIAL
+#undef RT_CLASS_32_FULL
+#undef RT_CLASS_125_FULL
+#undef RT_CLASS_256
+#undef RT_KIND_MIN_SIZE_CLASS
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_NUM_ENTRIES
+#undef RT_DUMP
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_GROW_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_4_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_4_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..eb87866b90
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,106 @@
+/* TODO: shrink nodes */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(NODE_IS_LEAF(node));
+#else
+ Assert(!NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ int idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, (uint64 *) n4->values,
+ n4->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
+ n4->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, (uint64 *) n32->values,
+ n32->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int idx;
+ int bitnum;
+
+ if (slotpos == RT_NODE_125_INVALID_IDX)
+ return false;
+
+ idx = BM_IDX(slotpos);
+ bitnum = BM_BIT(slotpos);
+ n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+ n125->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+ RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+ break;
+ }
+ }
+
+ /* update statistics */
+ node->count--;
+
+ return true;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..e4faf54d9d
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,316 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+ RT_PTR_LOCAL newnode = NULL;
+ RT_PTR_ALLOC allocnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ const bool inner = false;
+ Assert(NODE_IS_LEAF(node));
+#else
+ const bool inner = true;
+ Assert(!NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ int idx;
+
+ idx = RT_NODE_4_SEARCH_EQ(&n4->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n4->values[idx] = value;
+#else
+ n4->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ {
+ RT_NODE32_TYPE *new32;
+ const uint8 new_kind = RT_NODE_KIND_32;
+ const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+ /* grow node from 4 to 32 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, inner);
+ RT_COPY_NODE(newnode, node);
+ //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
+ new32 = (RT_NODE32_TYPE *) newnode;
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
+ new32->base.chunks, new32->values);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
+ new32->base.chunks, new32->children);
+#endif
+ Assert(parent != NULL);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_4_GET_INSERTPOS(&n4->base, chunk);
+ int count = n4->base.n.count;
+
+ /* shift chunks and children */
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n4->base.chunks, n4->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n4->base.chunks, n4->children,
+ count, insertpos);
+#endif
+ }
+
+ n4->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n4->values[insertpos] = value;
+#else
+ n4->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_PARTIAL];
+ const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_FULL];
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx;
+
+ idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[idx] = value;
+#else
+ n32->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+ n32->base.n.fanout == class32_min.fanout)
+ {
+ /* grow to the next size class of this kind */
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
+
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+#ifdef RT_NODE_LEVEL_LEAF
+ memcpy(newnode, node, class32_min.leaf_size);
+#else
+ memcpy(newnode, node, class32_min.inner_size);
+#endif
+ newnode->fanout = class32_max.fanout;
+
+ Assert(parent != NULL);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ node = newnode;
+
+ /* also update pointer for this kind */
+ n32 = (RT_NODE32_TYPE *) newnode;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ {
+ RT_NODE125_TYPE *new125;
+ const uint8 new_kind = RT_NODE_KIND_125;
+ const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+ Assert(n32->base.n.fanout == class32_max.fanout);
+
+ /* grow node from 32 to 125 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, inner);
+ RT_COPY_NODE(newnode, node);
+ //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
+ new125 = (RT_NODE125_TYPE *) newnode;
+
+ for (int i = 0; i < class32_max.fanout; i++)
+ {
+ new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ new125->values[i] = n32->values[i];
+#else
+ new125->children[i] = n32->children[i];
+#endif
+ }
+
+ Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+ new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+ Assert(parent != NULL);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+ int count = n32->base.n.count;
+
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+ count, insertpos);
+#endif
+ }
+
+ n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = value;
+#else
+ n32->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int cnt = 0;
+
+ if (slotpos != RT_NODE_125_INVALID_IDX)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = value;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ {
+ RT_NODE256_TYPE *new256;
+ const uint8 new_kind = RT_NODE_KIND_256;
+ const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+ /* grow node from 125 to 256 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, inner);
+ RT_COPY_NODE(newnode, node);
+ //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
+ new256 = (RT_NODE256_TYPE *) newnode;
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+ RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ cnt++;
+ }
+
+ Assert(parent != NULL);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int idx;
+ bitmapword inverse;
+
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < BM_IDX(128); idx++)
+ {
+ if (n125->base.isset[idx] < ~((bitmapword) 0))
+ break;
+ }
+
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(n125->base.isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+ Assert(slotpos < node->fanout);
+
+ /* mark the slot used */
+ n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+ n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = value;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+#else
+ chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
+#endif
+ Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(n256, chunk, value);
+#else
+ RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ RT_VERIFY_NODE(node);
+
+ return chunk_exists;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..0b8b68df6c
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,138 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ bool found = false;
+ uint8 key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ uint64 value;
+
+ Assert(NODE_IS_LEAF(node_iter->node));
+#else
+ RT_PTR_LOCAL child = NULL;
+
+ Assert(!NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+ Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n4->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n4->children[node_iter->current_idx]);
+#endif
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+#ifdef RT_NODE_LEVEL_LEAF
+ if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+ if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = value;
+#endif
+ }
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return found;
+#else
+ return child;
+#endif
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..31e4978e4f
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,131 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ uint64 value = 0;
+
+ Assert(NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+#endif
+ Assert(!NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ int idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n4->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n4->values[idx];
+#else
+ child = n4->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n32->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[idx];
+#else
+ child = n32->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+ Assert(slotpos != RT_NODE_125_INVALID_IDX);
+ n125->children[slotpos] = new_child;
+#else
+ if (slotpos == RT_NODE_125_INVALID_IDX)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+ child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+ RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+ child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ }
+
+#ifdef RT_ACTION_UPDATE
+ return;
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(value_p != NULL);
+ *value_p = value;
+#else
+ Assert(child_p != NULL);
+ *child_p = child;
+#endif
+
+ return true;
+#endif /* RT_ACTION_UPDATE */
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 104386e674..c67f936880 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
test_pg_db_role_setting \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
subdir('test_pg_db_role_setting')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..61d842789d
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,631 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int rt_node_kind_fanouts[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 125, /* RT_NODE_KIND_125 */
+ 256 /* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ uint64 dummy;
+ uint64 key;
+ uint64 val;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ rt_radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_set(radixtree, keys[i], keys[i] + 1))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ uint64 val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx - 1]
+ : rt_node_kind_fanouts[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx]
+ : rt_node_kind_fanouts[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(radixtree_ctx, dsa);
+#else
+ radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ rt_free(radixtree);
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ test_basic(rt_node_kind_fanouts[i], false);
+ test_basic(rt_node_kind_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
--
2.31.1
v20-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/octet-stream; name=v20-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload
From 055a13ace935bd5c6ca421437efb371a25e79b8f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v20 01/13] introduce vector8_min and vector8_highbit_mask
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..84d41a340a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
static inline bool vector8_has_zero(const Vector8 v);
static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
#endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
#endif
}
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+ uint32 mask = 0;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+ return mask;
+#endif
+}
+
/*
* Exactly like vector8_is_highbit_set except for the input type, so it
* looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.31.1
v20-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchapplication/octet-stream; name=v20-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload
From 3e97873aee57c929e38cc38c35205de3e3fb8525 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v20 02/13] Move some bitmap logic out of bitmapset.c
Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.
Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().
Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.
Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
src/backend/nodes/bitmapset.c | 36 ++------------------------------
src/include/nodes/bitmapset.h | 16 ++++++++++++--
src/include/port/pg_bitutils.h | 31 +++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 -
4 files changed, 47 insertions(+), 37 deletions(-)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x))
/*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
{
int result;
- w = RIGHTMOST_ONE(w);
+ w = bmw_rightmost_one(w);
a->words[wordnum] &= ~w;
result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 0dca6bc5fa..80e91fac0f 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
#else
#define BITS_PER_BITMAPWORD 32
typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
#endif
@@ -75,6 +73,20 @@ typedef enum
BMS_MULTIPLE /* >1 member */
} BMS_Membership;
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w) pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
+#define bmw_popcount(w) pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w) pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
+#define bmw_popcount(w) pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
/*
* function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+ int32 result = (int32) word & -((int32) word);
+ return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+ int64 result = (int64) word & -((int64) word);
+ return (uint64) result;
+}
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 23bafec5f7..5bd3da4948 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3662,7 +3662,6 @@ shmem_request_hook_type
shmem_startup_hook_type
sig_atomic_t
sigjmp_buf
-signedbitmapword
sigset_t
size_t
slist_head
--
2.31.1
On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
In v21, all of your v20 improvements to the radix tree template and test
have been squashed into 0003, with one exception: v20-0010 (recursive
freeing of shared mem), which I've attached separately (for flexibility) as
v21-0006. I believe one of your earlier patches had a new DSA function for
freeing memory more quickly -- was there a problem with that approach? I
don't recall where that discussion went.
+ * XXX: Most functions in this file have two variants for inner nodes
and leaf
+ * nodes, therefore there are duplication codes. While this sometimes
makes the
+ * code maintenance tricky, this reduces branch prediction misses when
judging
+ * whether the node is a inner node of a leaf node.
This comment seems to be out-of-date since we made it a template.
Done in 0020, along with a bunch of other comment editing.
The following macros are defined but not undefined in radixtree.h:
Fixed in v21-0018.
Also:
0007 makes the value type configurable. Some debug functionality still
assumes integer type, but I think the rest is agnostic.
0010 turns node4 into node3, as discussed, going from 48 bytes to 32.
0012 adopts the benchmark module to the template, and adds meson support
(builds with warnings, but okay because not meant for commit).
The rest are cleanups, small refactorings, and more comment rewrites. I've
kept them separate for visibility. Next patch can squash them unless there
is any discussion.
uint32 is how we store the block number, so this too small and will
wrap around on overflow. int64 seems better.
Agreed, will fix.
Great, but it's now uint64, not int64. All the large counters in struct
LVRelState, for example, are signed integers, as the usual practice.
Unsigned ints are "usually" for things like bit patterns and where explicit
wraparound is desired. There's probably more that can be done here to
change to signed types, but I think it's still a bit early to get to that
level of nitpicking. (Soon, I hope :-) )
+ * We calculate the maximum bytes for the TidStore in different ways + * for non-shared case and shared case. Please refer to the comment + * TIDSTORE_MEMORY_DEDUCT for details. + */Maybe the #define and comment should be close to here.
Will fix.
For this, I intended that "here" meant "in or just above the function".
+#define TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT (1024L * 70) /* 70kB */
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2 (float) 0.75
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO (float) 0.6
These symbols are used only once, in tidstore_create(), and are difficult
to read. That function has few comments. The symbols have several
paragraphs, but they are far away. It might be better for readability to
just hard-code numbers in the function, with the explanation about the
numbers near where they are used.
+ * Destroy a TidStore, returning all memory. The caller must be
certain that
+ * no other backend will attempt to access the TidStore before calling
this
+ * function. Other backend must explicitly call tidstore_detach to
free up
+ * backend-local memory associated with the TidStore. The backend that
calls
+ * tidstore_destroy must not call tidstore_detach. + */ +void +tidstore_destroy(TidStore *ts)If not addressed by next patch, need to phrase comment with FIXME or
TODO about making certain.
Will fix.
Did anything change here? There is also this, in the template, which I'm
not sure has been addressed:
* XXX: Currently we allow only one process to do iteration. Therefore,
rt_node_iter
* has the local pointers to nodes, rather than RT_PTR_ALLOC.
* We need either a safeguard to disallow other processes to begin the
iteration
* while one process is doing or to allow multiple processes to do the
iteration.
This part only runs "if (vacrel->nindexes == 0)", so seems like
unneeded complexity. It arises because lazy_scan_prune() populates the tid
store even if no index vacuuming happens. Perhaps the caller of
lazy_scan_prune() could pass the deadoffsets array, and upon returning,
either populate the store or call lazy_vacuum_heap_page(), as needed. It's
quite possible I'm missing some detail, so some description of the design
choices made would be helpful.
I agree that we don't need complexity here. I'll try this idea.
Keeping the offsets array in the prunestate seems to work out well.
Some other quick comments on tid store and vacuum, not comprehensive. Let
me know if I've misunderstood something:
TID store:
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
I was confused for a while, and I realized the bits are in reverse order
from how they are usually pictured (high on left, low on the right).
+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage <
2^11
+ * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with
+ * XXX: if we want to support non-heap table AM that want to use the full
+ * range of possible offset numbers, we'll need to reconsider
+ * TIDSTORE_OFFSET_NBITS value.
Would it be worth it (or possible) to calculate constants based on
compile-time block size? And/or have a fallback for other table AMs? Since
this file is in access/common, the intention is to allow general-purpose, I
imagine.
+typedef dsa_pointer tidstore_handle;
It's not clear why we need a typedef here, since here:
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
...
+ control = handle;
...there is a differently-named dsa_pointer variable that just gets the
function parameter.
+/* Return the maximum memory TidStore can use */
+uint64
+tidstore_max_memory(TidStore *ts)
size_t is more suitable for memory.
+ /*
+ * Since the shared radix tree supports concurrent insert,
+ * we don't need to acquire the lock.
+ */
Hmm? IIUC, the caller only acquires the lock after returning from here, to
update statistics. Why is it safe to insert with no lock? Am I missing
something?
VACUUM integration:
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2
Seems like unnecessary churn? It is still all about dead items, after all.
I understand using "DSA" for the LWLock, since that matches surrounding
code.
+#define HAS_LPDEAD_ITEMS(state) (((state).lpdead_items) > 0)
This macro helps the patch readability in some places, but I'm not sure it
helps readability of the file as a whole. The following is in the patch and
seems perfectly clear without the macro:
- if (lpdead_items > 0)
+ if (prunestate->lpdead_items > 0)
About shared memory: I have some mild reservations about the naming of the
"control object", which may be in shared memory. Is that an established
term? (If so, disregard the rest): It seems backwards -- the thing in
shared memory is the actual tree itself. The thing in backend-local memory
has the "handle", and that's how we control the tree. I don't have a better
naming scheme, though, and might not be that important. (Added a WIP
comment)
Now might be a good time to look at earlier XXX comments and come up with a
plan to address them.
That's all I have for now.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v21-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v21-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload
From 2bd133c432a960f79ec58edbf0fe0767620d81c0 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v21 02/22] Move some bitmap logic out of bitmapset.c
Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.
Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().
Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.
Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
src/backend/nodes/bitmapset.c | 36 ++------------------------------
src/include/nodes/bitmapset.h | 16 ++++++++++++--
src/include/port/pg_bitutils.h | 31 +++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 -
4 files changed, 47 insertions(+), 37 deletions(-)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x))
/*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
{
int result;
- w = RIGHTMOST_ONE(w);
+ w = bmw_rightmost_one(w);
a->words[wordnum] &= ~w;
result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 0dca6bc5fa..80e91fac0f 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
#else
#define BITS_PER_BITMAPWORD 32
typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
#endif
@@ -75,6 +73,20 @@ typedef enum
BMS_MULTIPLE /* >1 member */
} BMS_Membership;
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w) pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
+#define bmw_popcount(w) pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w) pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
+#define bmw_popcount(w) pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
/*
* function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+ int32 result = (int32) word & -((int32) word);
+ return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+ int64 result = (int64) word & -((int64) word);
+ return (uint64) result;
+}
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 24510ac29e..758e20f148 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3660,7 +3660,6 @@ shmem_request_hook_type
shmem_startup_hook_type
sig_atomic_t
sigjmp_buf
-signedbitmapword
sigset_t
size_t
slist_head
--
2.39.0
v21-0001-introduce-vector8_min-and-vector8_highbit_mask.patchtext/x-patch; charset=US-ASCII; name=v21-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload
From b0edeae77488d98733752a9190d1af36838b645f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v21 01/22] introduce vector8_min and vector8_highbit_mask
TODO: commit message
TODO: Remove uint64 case.
separate-commit TODO: move non-SIMD fallbacks to own header
to clean up the #ifdef maze.
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..84d41a340a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
static inline bool vector8_has_zero(const Vector8 v);
static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
#endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
#endif
}
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+ uint32 mask = 0;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+ return mask;
+#endif
+}
+
/*
* Exactly like vector8_is_highbit_set except for the input type, so it
* looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.39.0
v21-0005-Restore-RT_GROW_NODE_KIND.patchtext/x-patch; charset=US-ASCII; name=v21-0005-Restore-RT_GROW_NODE_KIND.patchDownload
From 7af8716587b466a298052c8185cf51ce38399686 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Jan 2023 11:32:24 +0700
Subject: [PATCH v21 05/22] Restore RT_GROW_NODE_KIND
(This was previously "exploded" out during the work to
switch this to a template)
Change the API so that we pass it the allocated pointer
and return the local pointer. That way, there is consistency
in growing nodes whether we change kind or not.
Also rename to RT_SWITCH_NODE_KIND, since it should work just as
well for shrinking nodes.
---
src/include/lib/radixtree.h | 104 +++---------------------
src/include/lib/radixtree_insert_impl.h | 24 ++----
2 files changed, 19 insertions(+), 109 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index a1458bc25f..c08016de3a 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -127,10 +127,9 @@
#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
#define RT_INIT_NODE RT_MAKE_NAME(init_node)
#define RT_FREE_NODE RT_MAKE_NAME(free_node)
-#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
#define RT_EXTEND RT_MAKE_NAME(extend)
#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
-//#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
@@ -1080,26 +1079,22 @@ RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
newnode->shift = oldnode->shift;
newnode->count = oldnode->count;
}
-#if 0
+
/*
- * Create a new node with 'new_kind' and the same shift, chunk, and
- * count of 'node'.
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
*/
-static RT_NODE*
-RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_LOCAL node, uint8 new_kind)
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+ uint8 new_kind, uint8 new_class, bool inner)
{
- RT_PTR_ALLOC allocnode;
- RT_PTR_LOCAL newnode;
- bool inner = !NODE_IS_LEAF(node);
-
- allocnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
- newnode = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(newnode, new_kind, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+ RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, inner);
RT_COPY_NODE(newnode, node);
return newnode;
}
-#endif
+
/* Free the given node */
static void
RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
@@ -1415,78 +1410,6 @@ RT_GET_HANDLE(RT_RADIX_TREE *tree)
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
return tree->ctl->handle;
}
-
-/*
- * Recursively free all nodes allocated to the DSA area.
- */
-static inline void
-RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
-{
- RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
-
- check_stack_depth();
- CHECK_FOR_INTERRUPTS();
-
- /* The leaf node doesn't have child pointers */
- if (NODE_IS_LEAF(node))
- {
- dsa_free(tree->dsa, ptr);
- return;
- }
-
- switch (node->kind)
- {
- case RT_NODE_KIND_4:
- {
- RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
-
- for (int i = 0; i < n4->base.n.count; i++)
- RT_FREE_RECURSE(tree, n4->children[i]);
-
- break;
- }
- case RT_NODE_KIND_32:
- {
- RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
-
- for (int i = 0; i < n32->base.n.count; i++)
- RT_FREE_RECURSE(tree, n32->children[i]);
-
- break;
- }
- case RT_NODE_KIND_125:
- {
- RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
-
- for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
- {
- if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
- continue;
-
- RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
- }
-
- break;
- }
- case RT_NODE_KIND_256:
- {
- RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
-
- for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
- {
- if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
- continue;
-
- RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
- }
-
- break;
- }
- }
-
- /* Free the inner node */
- dsa_free(tree->dsa, ptr);
-}
#endif
/*
@@ -1498,10 +1421,6 @@ RT_FREE(RT_RADIX_TREE *tree)
#ifdef RT_SHMEM
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
- /* Free all memory used for radix tree nodes */
- if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
- RT_FREE_RECURSE(tree, tree->ctl->root);
-
/*
* Vandalize the control block to help catch programming error where
* other backends access the memory formerly occupied by this radix tree.
@@ -2280,10 +2199,9 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_ALLOC_NODE
#undef RT_INIT_NODE
#undef RT_FREE_NODE
-#undef RT_FREE_RECURSE
#undef RT_EXTEND
#undef RT_SET_EXTEND
-#undef RT_GROW_NODE_KIND
+#undef RT_SWITCH_NODE_KIND
#undef RT_COPY_NODE
#undef RT_REPLACE_NODE
#undef RT_PTR_GET_LOCAL
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 1d0eb396e2..e3e44669ea 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -53,11 +53,9 @@
/* grow node from 4 to 32 */
allocnode = RT_ALLOC_NODE(tree, new_class, inner);
- newnode = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(newnode, new_kind, new_class, inner);
- RT_COPY_NODE(newnode, node);
- //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
new32 = (RT_NODE32_TYPE *) newnode;
+
#ifdef RT_NODE_LEVEL_LEAF
RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
new32->base.chunks, new32->values);
@@ -119,13 +117,15 @@
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
n32->base.n.fanout == class32_min.fanout)
{
- /* grow to the next size class of this kind */
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL newnode;
const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
+ /* grow to the next size class of this kind */
allocnode = RT_ALLOC_NODE(tree, new_class, inner);
newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ n32 = (RT_NODE32_TYPE *) newnode;
+
#ifdef RT_NODE_LEVEL_LEAF
memcpy(newnode, node, class32_min.leaf_size);
#else
@@ -135,9 +135,6 @@
RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
node = newnode;
-
- /* also update pointer for this kind */
- n32 = (RT_NODE32_TYPE *) newnode;
}
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
@@ -152,10 +149,7 @@
/* grow node from 32 to 125 */
allocnode = RT_ALLOC_NODE(tree, new_class, inner);
- newnode = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(newnode, new_kind, new_class, inner);
- RT_COPY_NODE(newnode, node);
- //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
new125 = (RT_NODE125_TYPE *) newnode;
for (int i = 0; i < class32_max.fanout; i++)
@@ -229,11 +223,9 @@
/* grow node from 125 to 256 */
allocnode = RT_ALLOC_NODE(tree, new_class, inner);
- newnode = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(newnode, new_kind, new_class, inner);
- RT_COPY_NODE(newnode, node);
- //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
new256 = (RT_NODE256_TYPE *) newnode;
+
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
{
if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
--
2.39.0
v21-0004-Clean-up-some-nomenclature-around-node-insertion.patchtext/x-patch; charset=US-ASCII; name=v21-0004-Clean-up-some-nomenclature-around-node-insertion.patchDownload
From d9f4b6280f73076df05c1fd03ca6860df3b90c74 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Thu, 19 Jan 2023 16:33:51 +0700
Subject: [PATCH v21 04/22] Clean up some nomenclature around node insertion
Replace node/nodep with hopefully more informative names.
In passing, remove some outdated asserts and move some
variable declarations to the scope where they're used.
---
src/include/lib/radixtree.h | 64 ++++++++++++++-----------
src/include/lib/radixtree_insert_impl.h | 22 +++++----
2 files changed, 47 insertions(+), 39 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 97cccdc9ca..a1458bc25f 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -645,9 +645,9 @@ typedef struct RT_ITER
} RT_ITER;
-static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
uint64 key, RT_PTR_ALLOC child);
-static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
uint64 key, uint64 value);
/* verification (available only with assertion) */
@@ -1153,18 +1153,18 @@ RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
* Replace old_child with new_child, and free the old one.
*/
static void
-RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
RT_PTR_ALLOC new_child, uint64 key)
{
- RT_PTR_LOCAL old = RT_PTR_GET_LOCAL(tree, old_child);
-
#ifdef USE_ASSERT_CHECKING
RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
- Assert(old->shift == new->shift);
+ Assert(old_child->shift == new->shift);
+ Assert(old_child->count == new->count);
#endif
- if (parent == old)
+ if (parent == old_child)
{
/* Replace the root node with the new large node */
tree->ctl->root = new_child;
@@ -1172,7 +1172,7 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child
else
RT_NODE_UPDATE_INNER(parent, key, new_child);
- RT_FREE_NODE(tree, old_child);
+ RT_FREE_NODE(tree, stored_old_child);
}
/*
@@ -1220,11 +1220,11 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
*/
static inline void
RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
- RT_PTR_ALLOC nodep, RT_PTR_LOCAL node)
+ RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
{
int shift = node->shift;
- Assert(RT_PTR_GET_LOCAL(tree, nodep) == node);
+ Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
while (shift >= RT_NODE_SPAN)
{
@@ -1237,15 +1237,15 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent
newchild = RT_PTR_GET_LOCAL(tree, allocchild);
RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
newchild->shift = newshift;
- RT_NODE_INSERT_INNER(tree, parent, nodep, node, key, allocchild);
+ RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
parent = node;
node = newchild;
- nodep = allocchild;
+ stored_node = allocchild;
shift -= RT_NODE_SPAN;
}
- RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+ RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value);
tree->ctl->num_keys++;
}
@@ -1305,9 +1305,15 @@ RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
}
#endif
-/* Insert the child to the inner node */
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
static bool
-RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
uint64 key, RT_PTR_ALLOC child)
{
#define RT_NODE_LEVEL_INNER
@@ -1315,9 +1321,9 @@ RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC node
#undef RT_NODE_LEVEL_INNER
}
-/* Insert the value to the leaf node */
+/* Like, RT_NODE_INSERT_INNER, but for leaf nodes */
static bool
-RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
uint64 key, uint64 value)
{
#define RT_NODE_LEVEL_LEAF
@@ -1525,8 +1531,8 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
int shift;
bool updated;
RT_PTR_LOCAL parent;
- RT_PTR_ALLOC nodep;
- RT_PTR_LOCAL node;
+ RT_PTR_ALLOC stored_child;
+ RT_PTR_LOCAL child;
#ifdef RT_SHMEM
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
@@ -1540,32 +1546,32 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
if (key > tree->ctl->max_val)
RT_EXTEND(tree, key);
- nodep = tree->ctl->root;
- parent = RT_PTR_GET_LOCAL(tree, nodep);
+ stored_child = tree->ctl->root;
+ parent = RT_PTR_GET_LOCAL(tree, stored_child);
shift = parent->shift;
/* Descend the tree until a leaf node */
while (shift >= 0)
{
- RT_PTR_ALLOC child;
+ RT_PTR_ALLOC new_child;
- node = RT_PTR_GET_LOCAL(tree, nodep);
+ child = RT_PTR_GET_LOCAL(tree, stored_child);
- if (NODE_IS_LEAF(node))
+ if (NODE_IS_LEAF(child))
break;
- if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
{
- RT_SET_EXTEND(tree, key, value, parent, nodep, node);
+ RT_SET_EXTEND(tree, key, value, parent, stored_child, child);
return false;
}
- parent = node;
- nodep = child;
+ parent = child;
+ stored_child = new_child;
shift -= RT_NODE_SPAN;
}
- updated = RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+ updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value);
/* Update the statistics */
if (!updated)
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index e4faf54d9d..1d0eb396e2 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -14,8 +14,6 @@
uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
bool chunk_exists = false;
- RT_PTR_LOCAL newnode = NULL;
- RT_PTR_ALLOC allocnode;
#ifdef RT_NODE_LEVEL_LEAF
const bool inner = false;
@@ -47,6 +45,8 @@
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
{
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
RT_NODE32_TYPE *new32;
const uint8 new_kind = RT_NODE_KIND_32;
const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
@@ -65,8 +65,7 @@
RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
new32->base.chunks, new32->children);
#endif
- Assert(parent != NULL);
- RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
node = newnode;
}
else
@@ -121,6 +120,8 @@
n32->base.n.fanout == class32_min.fanout)
{
/* grow to the next size class of this kind */
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
allocnode = RT_ALLOC_NODE(tree, new_class, inner);
@@ -132,8 +133,7 @@
#endif
newnode->fanout = class32_max.fanout;
- Assert(parent != NULL);
- RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
node = newnode;
/* also update pointer for this kind */
@@ -142,6 +142,8 @@
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
{
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
RT_NODE125_TYPE *new125;
const uint8 new_kind = RT_NODE_KIND_125;
const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
@@ -169,8 +171,7 @@
Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
- Assert(parent != NULL);
- RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
node = newnode;
}
else
@@ -220,6 +221,8 @@
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
{
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
RT_NODE256_TYPE *new256;
const uint8 new_kind = RT_NODE_KIND_256;
const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
@@ -243,8 +246,7 @@
cnt++;
}
- Assert(parent != NULL);
- RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
node = newnode;
}
else
--
2.39.0
v21-0003-Add-radixtree-template.patchtext/x-patch; charset=US-ASCII; name=v21-0003-Add-radixtree-template.patchDownload
From 2035dde63943dc5461a69fc7aa1f510e68f1cd64 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v21 03/22] Add radixtree template
The only thing configurable in this commit is function scope,
prefix, and local/shared memory.
The key and value type are still hard-coded to uint64.
(A later commit in v21 will make value type configurable)
It might be good at some point to offer a different tree type,
e.g. "single-value leaves" to allow for variable length keys
and values, giving full flexibility to developers.
TODO: Much broader commit message
---
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 2321 +++++++++++++++++
src/include/lib/radixtree_delete_impl.h | 106 +
src/include/lib/radixtree_insert_impl.h | 316 +++
src/include/lib/radixtree_iter_impl.h | 138 +
src/include/lib/radixtree_search_impl.h | 131 +
src/include/utils/dsa.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 36 +
src/test/modules/test_radixtree/meson.build | 35 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 653 +++++
.../test_radixtree/test_radixtree.control | 4 +
src/tools/pginclude/cpluspluscheck | 6 +
src/tools/pginclude/headerscheck | 6 +
20 files changed, 3816 insertions(+)
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/include/lib/radixtree_delete_impl.h
create mode 100644 src/include/lib/radixtree_insert_impl.h
create mode 100644 src/include/lib/radixtree_iter_impl.h
create mode 100644 src/include/lib/radixtree_search_impl.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 604b702a91..50f0aae3ab 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..97cccdc9ca
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2321 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves". We
+ * choose it to avoid an additional pointer traversal. It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included. Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * will result in radix tree type 'foo_radix_tree' and functions like
+ * 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ * generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ * declarations reside
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ * so that multiple processes can access it simultaneously.
+ *
+ * Optional parameters:
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE - Create a new, empty radix tree
+ * RT_FREE - Free the radix tree
+ * RT_ATTACH - Attach to the radix tree
+ * RT_DETACH - Detach from the radix tree
+ * RT_GET_HANDLE - Return the handle of the radix tree
+ * RT_SEARCH - Search a key-value pair
+ * RT_SET - Set a key-value pair
+ * RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT - Return next key-value pair, if any
+ * RT_END_ITER - End iteration
+ * RT_MEMORY_USAGE - Get the memory usage
+ *
+ * RT_CREATE() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * RT_ITERATE_NEXT() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE - Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+//#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_4_SEARCH_EQ RT_MAKE_NAME(node_4_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_4_GET_INSERTPOS RT_MAKE_NAME(node_4_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_4 RT_MAKE_NAME(node_inner_4)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_4 RT_MAKE_NAME(node_leaf_4)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_4_FULL RT_MAKE_NAME(class_4_full)
+#define RT_CLASS_32_PARTIAL RT_MAKE_NAME(class_32_partial)
+#define RT_CLASS_32_FULL RT_MAKE_NAME(class_32_full)
+#define RT_CLASS_125_FULL RT_MAKE_NAME(class_125_full)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+#define RT_KIND_MIN_SIZE_CLASS RT_MAKE_NAME(kind_min_size_class)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif /* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* macros and types common to all implementations */
+#ifndef RT_COMMON
+#define RT_COMMON
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
+#define BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of RT_NODE. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+#endif /* RT_COMMON */
+
+
+typedef enum RT_SIZE_CLASS
+{
+ RT_CLASS_4_FULL = 0,
+ RT_CLASS_32_PARTIAL,
+ RT_CLASS_32_FULL,
+ RT_CLASS_125_FULL,
+ RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Max capacity for the current size class. Storing this in the
+ * node enables multiple size classes per node kind.
+ * Technically, kinds with a single size class don't need this, so we could
+ * keep this in the individual base types, but the code is simpler this way.
+ * Note: node256 is unique in that it cannot possibly have more than a
+ * single size class, so for that kind we store zero, and uint8 is
+ * sufficient for other kinds.
+ */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#define NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
+#define NODE_IS_EMPTY(n) (((RT_PTR_LOCAL) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+ ((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+ ((node)->base.n.count < RT_SIZE_CLASS_INFO[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct RT_NODE_BASE_4
+{
+ RT_NODE n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+} RT_NODE_BASE_4;
+
+typedef struct RT_NODE_BASE_32
+{
+ RT_NODE n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct RT_NODE_BASE_125
+{
+ RT_NODE n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[BM_IDX(128)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+ RT_NODE n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ * width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct RT_NODE_INNER_4
+{
+ RT_NODE_BASE_4 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_4;
+
+typedef struct RT_NODE_LEAF_4
+{
+ RT_NODE_BASE_4 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_4;
+
+typedef struct RT_NODE_INNER_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct RT_NODE_INNER_256
+{
+ RT_NODE_BASE_256 base;
+
+ /* Slots for 256 children */
+ RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+ RT_NODE_BASE_256 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[BM_IDX(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ uint64 values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+
+ /* slab block size */
+ Size inner_blocksize;
+ Size leaf_blocksize;
+} RT_SIZE_CLASS_ELEM;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+ [RT_CLASS_4_FULL] = {
+ .name = "radix tree node 4",
+ .fanout = 4,
+ .inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_PARTIAL] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_FULL] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64)),
+ },
+ [RT_CLASS_125_FULL] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64)),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(RT_NODE_INNER_256),
+ .leaf_size = sizeof(RT_NODE_LEAF_256),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_256)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_256)),
+ },
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+/* Map from the node kind to its minimum size class */
+static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
+ [RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+ [RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+ [RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+ [RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+ RT_HANDLE handle;
+ uint32 magic;
+#endif
+
+ RT_PTR_ALLOC root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE
+{
+ MemoryContext context;
+
+ /* pointing to either local memory or DSA */
+ RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+ dsa_area *dsa;
+#else
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
++ *
++ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
++ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
++ * We need either a safeguard to disallow other processes to begin the iteration
++ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+ RT_PTR_LOCAL node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+ RT_RADIX_TREE *tree;
+
+ /* Track the iteration on nodes of each level */
+ RT_NODE_ITER stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+} RT_ITER;
+
+
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+ uint64 key, uint64 value);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+ return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+ return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+ return DsaPointerIsValid(ptr);
+#else
+ return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+RT_NODE_4_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+RT_NODE_4_GET_INSERTPOS(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+ uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, uint64 *src_values,
+ uint8 *dst_chunks, uint64 *dst_values)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(uint64) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+ return node->children[chunk];
+}
+
+static inline uint64
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, uint64 value)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[idx] |= ((bitmapword) 1 << bitnum);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
+{
+ RT_PTR_ALLOC allocnode;
+ size_t allocsize;
+
+ if (inner)
+ allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+ else
+ allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+
+#ifdef RT_SHMEM
+ allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+ if (inner)
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+ allocsize);
+ else
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ allocsize);
+#endif
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->ctl->cnt[size_class]++;
+#endif
+
+ return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner)
+{
+ if (inner)
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+ else
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+
+ node->kind = kind;
+
+ if (kind == RT_NODE_KIND_256)
+ /* See comment for the RT_NODE type */
+ Assert(node->fanout == 0);
+ else
+ node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+ memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+ }
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+ int shift = RT_KEY_GET_SHIFT(key);
+ bool inner = shift > 0;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newnode->shift = shift;
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+ tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->count = oldnode->count;
+}
+#if 0
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static RT_NODE*
+RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_LOCAL node, uint8 new_kind)
+{
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ bool inner = !NODE_IS_LEAF(node);
+
+ allocnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+ RT_COPY_NODE(newnode, node);
+
+ return newnode;
+}
+#endif
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->ctl->root == allocnode)
+ {
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+ tree->ctl->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->ctl->cnt[i]--;
+ Assert(tree->ctl->cnt[i] >= 0);
+ }
+#endif
+
+#ifdef RT_SHMEM
+ dsa_free(tree->dsa, allocnode);
+#else
+ pfree(allocnode);
+#endif
+}
+
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
+ RT_PTR_ALLOC new_child, uint64 key)
+{
+ RT_PTR_LOCAL old = RT_PTR_GET_LOCAL(tree, old_child);
+
+#ifdef USE_ASSERT_CHECKING
+ RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+ Assert(old->shift == new->shift);
+#endif
+
+ if (parent == old)
+ {
+ /* Replace the root node with the new large node */
+ tree->ctl->root = new_child;
+ }
+ else
+ RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+ RT_FREE_NODE(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+ int target_shift;
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ int shift = root->shift + RT_NODE_SPAN;
+
+ target_shift = RT_KEY_GET_SHIFT(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ RT_NODE_INNER_4 *n4;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+ node->shift = shift;
+ node->count = 1;
+
+ n4 = (RT_NODE_INNER_4 *) node;
+ n4->base.chunks[0] = 0;
+ n4->children[0] = tree->ctl->root;
+
+ /* Update the root */
+ tree->ctl->root = allocnode;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC nodep, RT_PTR_LOCAL node)
+{
+ int shift = node->shift;
+
+ Assert(RT_PTR_GET_LOCAL(tree, nodep) == node);
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ RT_PTR_ALLOC allocchild;
+ RT_PTR_LOCAL newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool inner = newshift > 0;
+
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newchild->shift = newshift;
+ RT_NODE_INSERT_INNER(tree, parent, nodep, node, key, allocchild);
+
+ parent = node;
+ node = newchild;
+ nodep = allocchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+ tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/* Insert the child to the inner node */
+static bool
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Insert the value to the leaf node */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+ uint64 key, uint64 value)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+ RT_RADIX_TREE *tree;
+ MemoryContext old_ctx;
+#ifdef RT_SHMEM
+ dsa_pointer dp;
+#endif
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+ tree->context = ctx;
+
+#ifdef RT_SHMEM
+ tree->dsa = dsa;
+ dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+ tree->ctl->handle = dp;
+ tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+#else
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ RT_SIZE_CLASS_INFO[i].name,
+ RT_SIZE_CLASS_INFO[i].inner_blocksize,
+ RT_SIZE_CLASS_INFO[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ RT_SIZE_CLASS_INFO[i].name,
+ RT_SIZE_CLASS_INFO[i].leaf_blocksize,
+ RT_SIZE_CLASS_INFO[i].leaf_size);
+ }
+#endif
+
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+ RT_RADIX_TREE *tree;
+ dsa_pointer control;
+
+ /* XXX: memory context support */
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ tree->dsa = dsa;
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static inline void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();
+
+ /* The leaf node doesn't have child pointers */
+ if (NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->dsa, ptr);
+ return;
+ }
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+ for (int i = 0; i < n4->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n4->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ for (int i = 0; i < n32->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+ }
+
+ break;
+ }
+ }
+
+ /* Free the inner node */
+ dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ /* Free all memory used for radix tree nodes */
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_FREE_RECURSE(tree, tree->ctl->root);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->dsa, tree->ctl->handle);
+#else
+ pfree(tree->ctl);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+#endif
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
+{
+ int shift;
+ bool updated;
+ RT_PTR_LOCAL parent;
+ RT_PTR_ALLOC nodep;
+ RT_PTR_LOCAL node;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ /* Empty tree, create the root */
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_NEW_ROOT(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->ctl->max_val)
+ RT_EXTEND(tree, key);
+
+ nodep = tree->ctl->root;
+ parent = RT_PTR_GET_LOCAL(tree, nodep);
+ shift = parent->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child;
+
+ node = RT_PTR_GET_LOCAL(tree, nodep);
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_SET_EXTEND(tree, key, value, parent, nodep, node);
+ return false;
+ }
+
+ parent = node;
+ nodep = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->ctl->num_keys++;
+
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+ Assert(value_p != NULL);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ return false;
+
+ node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ shift = node->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ return false;
+
+ node = RT_PTR_GET_LOCAL(tree, child);
+ shift -= RT_NODE_SPAN;
+ }
+
+ return RT_NODE_SEARCH_LEAF(node, key, value_p);
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ return false;
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ RT_PTR_ALLOC child;
+
+ /* Push the current node to the stack */
+ stack[++level] = allocnode;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ return false;
+
+ allocnode = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_LEAF(node, key);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->ctl->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (!NODE_IS_EMPTY(node))
+ return true;
+
+ /* Free the empty leaf node */
+ RT_FREE_NODE(tree, allocnode);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ allocnode = stack[level--];
+
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_INNER(node, key);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!NODE_IS_EMPTY(node))
+ break;
+
+ /* The node became empty */
+ RT_FREE_NODE(tree, allocnode);
+ }
+
+ return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+ uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+ int level = from;
+ RT_PTR_LOCAL node = from_node;
+
+ for (;;)
+ {
+ RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/* Create and return the iterator for the given radix tree */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+ MemoryContext old_ctx;
+ RT_ITER *iter;
+ RT_PTR_LOCAL root;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree->ctl->root)
+ return iter;
+
+ root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+ top_level = root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->ctl->root)
+ return false;
+
+ for (;;)
+ {
+ RT_PTR_LOCAL child = NULL;
+ uint64 value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+ pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+ // XXX is this necessary?
+ Size total = sizeof(RT_RADIX_TREE);
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ total = dsa_get_total_size(tree->dsa);
+#else
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+#endif
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE_BASE_4 *n4 = (RT_NODE_BASE_4 *) node;
+
+ for (int i = 1; i < n4->n.count; i++)
+ Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ uint8 slot = n125->slot_idxs[i];
+ int idx = BM_IDX(slot);
+ int bitnum = BM_BIT(slot);
+
+ if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(slot < node->fanout);
+ Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
+ cnt += bmw_popcount(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+ ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+ tree->ctl->num_keys,
+ tree->ctl->root->shift / RT_NODE_SPAN,
+ tree->ctl->cnt[RT_CLASS_4_FULL],
+ tree->ctl->cnt[RT_CLASS_32_PARTIAL],
+ tree->ctl->cnt[RT_CLASS_32_FULL],
+ tree->ctl->cnt[RT_CLASS_125_FULL],
+ tree->ctl->cnt[RT_CLASS_256])));
+}
+
+static void
+RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
+{
+ char space[125] = {0};
+
+ fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
+ NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift);
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n4->base.chunks[i], n4->values[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n4->base.chunks[i]);
+
+ if (recurse)
+ RT_DUMP_NODE(n4->children[i], level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n32->base.chunks[i], n32->values[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ RT_DUMP_NODE(n32->children[i], level + 1, recurse);
+ }
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+
+ fprintf(stderr, "slot_idxs ");
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+ }
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
+
+ fprintf(stderr, ", isset-bitmap:");
+ for (int i = 0; i < BM_IDX(128); i++)
+ {
+ fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
+ }
+ fprintf(stderr, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+ }
+ else
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ RT_DUMP_NODE(RT_NODE_INNER_125_GET_CHILD(n125, i),
+ level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, RT_NODE_LEAF_256_GET_VALUE(n256, i));
+ }
+ else
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ RT_DUMP_NODE(RT_NODE_INNER_256_GET_CHILD(n256, i), level + 1,
+ recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+ tree->ctl->max_val, tree->ctl->max_val);
+
+ if (!tree->ctl->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->ctl->max_val)
+ {
+ elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+ key, key);
+ return;
+ }
+
+ node = tree->ctl->root;
+ shift = tree->ctl->root->shift;
+ while (shift >= 0)
+ {
+ RT_PTR_LOCAL child;
+
+ RT_DUMP_NODE(node, level, false);
+
+ if (NODE_IS_LEAF(node))
+ {
+ uint64 dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+ break;
+ }
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+ RT_SIZE_CLASS_INFO[i].name,
+ RT_SIZE_CLASS_INFO[i].inner_size,
+ RT_SIZE_CLASS_INFO[i].inner_blocksize,
+ RT_SIZE_CLASS_INFO[i].leaf_size,
+ RT_SIZE_CLASS_INFO[i].leaf_blocksize);
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+
+ if (!tree->ctl->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ RT_DUMP_NODE(tree->ctl->root, 0, true);
+}
+#endif
+
+#endif /* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+
+/* locally declared macros */
+#undef NODE_IS_LEAF
+#undef NODE_IS_EMPTY
+#undef VAR_NODE_HAS_FREE_SLOT
+#undef FIXED_NODE_HAS_FREE_SLOT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_RADIX_TREE_MAGIC
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_BASE_4
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_4
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_4
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_4_FULL
+#undef RT_CLASS_32_PARTIAL
+#undef RT_CLASS_32_FULL
+#undef RT_CLASS_125_FULL
+#undef RT_CLASS_256
+#undef RT_KIND_MIN_SIZE_CLASS
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_GROW_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_4_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_4_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..eb87866b90
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,106 @@
+/* TODO: shrink nodes */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(NODE_IS_LEAF(node));
+#else
+ Assert(!NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ int idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, (uint64 *) n4->values,
+ n4->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
+ n4->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, (uint64 *) n32->values,
+ n32->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int idx;
+ int bitnum;
+
+ if (slotpos == RT_NODE_125_INVALID_IDX)
+ return false;
+
+ idx = BM_IDX(slotpos);
+ bitnum = BM_BIT(slotpos);
+ n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+ n125->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+ RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+ break;
+ }
+ }
+
+ /* update statistics */
+ node->count--;
+
+ return true;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..e4faf54d9d
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,316 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+ RT_PTR_LOCAL newnode = NULL;
+ RT_PTR_ALLOC allocnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ const bool inner = false;
+ Assert(NODE_IS_LEAF(node));
+#else
+ const bool inner = true;
+ Assert(!NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ int idx;
+
+ idx = RT_NODE_4_SEARCH_EQ(&n4->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n4->values[idx] = value;
+#else
+ n4->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ {
+ RT_NODE32_TYPE *new32;
+ const uint8 new_kind = RT_NODE_KIND_32;
+ const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+ /* grow node from 4 to 32 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, inner);
+ RT_COPY_NODE(newnode, node);
+ //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
+ new32 = (RT_NODE32_TYPE *) newnode;
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
+ new32->base.chunks, new32->values);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
+ new32->base.chunks, new32->children);
+#endif
+ Assert(parent != NULL);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_4_GET_INSERTPOS(&n4->base, chunk);
+ int count = n4->base.n.count;
+
+ /* shift chunks and children */
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n4->base.chunks, n4->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n4->base.chunks, n4->children,
+ count, insertpos);
+#endif
+ }
+
+ n4->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n4->values[insertpos] = value;
+#else
+ n4->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_PARTIAL];
+ const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_FULL];
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx;
+
+ idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[idx] = value;
+#else
+ n32->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+ n32->base.n.fanout == class32_min.fanout)
+ {
+ /* grow to the next size class of this kind */
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
+
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+#ifdef RT_NODE_LEVEL_LEAF
+ memcpy(newnode, node, class32_min.leaf_size);
+#else
+ memcpy(newnode, node, class32_min.inner_size);
+#endif
+ newnode->fanout = class32_max.fanout;
+
+ Assert(parent != NULL);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ node = newnode;
+
+ /* also update pointer for this kind */
+ n32 = (RT_NODE32_TYPE *) newnode;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ {
+ RT_NODE125_TYPE *new125;
+ const uint8 new_kind = RT_NODE_KIND_125;
+ const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+ Assert(n32->base.n.fanout == class32_max.fanout);
+
+ /* grow node from 32 to 125 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, inner);
+ RT_COPY_NODE(newnode, node);
+ //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
+ new125 = (RT_NODE125_TYPE *) newnode;
+
+ for (int i = 0; i < class32_max.fanout; i++)
+ {
+ new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ new125->values[i] = n32->values[i];
+#else
+ new125->children[i] = n32->children[i];
+#endif
+ }
+
+ Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+ new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+ Assert(parent != NULL);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+ int count = n32->base.n.count;
+
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+ count, insertpos);
+#endif
+ }
+
+ n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = value;
+#else
+ n32->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int cnt = 0;
+
+ if (slotpos != RT_NODE_125_INVALID_IDX)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = value;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ {
+ RT_NODE256_TYPE *new256;
+ const uint8 new_kind = RT_NODE_KIND_256;
+ const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+ /* grow node from 125 to 256 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, inner);
+ RT_COPY_NODE(newnode, node);
+ //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
+ new256 = (RT_NODE256_TYPE *) newnode;
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+ RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ cnt++;
+ }
+
+ Assert(parent != NULL);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int idx;
+ bitmapword inverse;
+
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < BM_IDX(128); idx++)
+ {
+ if (n125->base.isset[idx] < ~((bitmapword) 0))
+ break;
+ }
+
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(n125->base.isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+ Assert(slotpos < node->fanout);
+
+ /* mark the slot used */
+ n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+ n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = value;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+#else
+ chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
+#endif
+ Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(n256, chunk, value);
+#else
+ RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ RT_VERIFY_NODE(node);
+
+ return chunk_exists;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..0b8b68df6c
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,138 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ bool found = false;
+ uint8 key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ uint64 value;
+
+ Assert(NODE_IS_LEAF(node_iter->node));
+#else
+ RT_PTR_LOCAL child = NULL;
+
+ Assert(!NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+ Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n4->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n4->children[node_iter->current_idx]);
+#endif
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+#ifdef RT_NODE_LEVEL_LEAF
+ if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+ if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = value;
+#endif
+ }
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return found;
+#else
+ return child;
+#endif
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..31e4978e4f
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,131 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ uint64 value = 0;
+
+ Assert(NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+#endif
+ Assert(!NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ int idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n4->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n4->values[idx];
+#else
+ child = n4->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n32->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[idx];
+#else
+ child = n32->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+ Assert(slotpos != RT_NODE_125_INVALID_IDX);
+ n125->children[slotpos] = new_child;
+#else
+ if (slotpos == RT_NODE_125_INVALID_IDX)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+ child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+ RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+ child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ }
+
+#ifdef RT_ACTION_UPDATE
+ return;
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(value_p != NULL);
+ *value_p = value;
+#else
+ Assert(child_p != NULL);
+ *child_p = child;
+#endif
+
+ return true;
+#endif /* RT_ACTION_UPDATE */
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 104386e674..c67f936880 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
test_pg_db_role_setting \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
subdir('test_pg_db_role_setting')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..d8323f587f
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,653 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int rt_node_kind_fanouts[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 125, /* RT_NODE_KIND_125 */
+ 256 /* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ uint64 dummy;
+ uint64 key;
+ uint64 val;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ rt_radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* look up keys */
+ for (int i = 0; i < children; i++)
+ {
+ uint64 value;
+
+ if (!rt_search(radixtree, keys[i], &value))
+ elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (value != keys[i])
+ elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+ value, keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_set(radixtree, keys[i], keys[i] + 1))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ uint64 val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx - 1]
+ : rt_node_kind_fanouts[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx]
+ : rt_node_kind_fanouts[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(radixtree_ctx, dsa);
+#else
+ radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ rt_free(radixtree);
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ test_basic(rt_node_kind_fanouts[i], false);
+ test_basic(rt_node_kind_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
--
2.39.0
v21-0006-Free-all-radix-tree-nodes-recursively.patchtext/x-patch; charset=US-ASCII; name=v21-0006-Free-all-radix-tree-nodes-recursively.patchDownload
From 6dca6018bd9ffbb6f00e26b01dfc80377a910440 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 20 Jan 2023 12:38:54 +0700
Subject: [PATCH v21 06/22] Free all radix tree nodes recursively
TODO: Consider adding more general functionality to DSA
to free all segments.
---
src/include/lib/radixtree.h | 78 +++++++++++++++++++++++++++++++++++++
1 file changed, 78 insertions(+)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index c08016de3a..98e4597eac 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -127,6 +127,7 @@
#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
#define RT_INIT_NODE RT_MAKE_NAME(init_node)
#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
#define RT_EXTEND RT_MAKE_NAME(extend)
#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
@@ -1410,6 +1411,78 @@ RT_GET_HANDLE(RT_RADIX_TREE *tree)
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
return tree->ctl->handle;
}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static inline void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();
+
+ /* The leaf node doesn't have child pointers */
+ if (NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->dsa, ptr);
+ return;
+ }
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+ for (int i = 0; i < n4->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n4->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ for (int i = 0; i < n32->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+ }
+
+ break;
+ }
+ }
+
+ /* Free the inner node */
+ dsa_free(tree->dsa, ptr);
+}
#endif
/*
@@ -1421,6 +1494,10 @@ RT_FREE(RT_RADIX_TREE *tree)
#ifdef RT_SHMEM
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ /* Free all memory used for radix tree nodes */
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_FREE_RECURSE(tree, tree->ctl->root);
+
/*
* Vandalize the control block to help catch programming error where
* other backends access the memory formerly occupied by this radix tree.
@@ -2199,6 +2276,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_ALLOC_NODE
#undef RT_INIT_NODE
#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
#undef RT_EXTEND
#undef RT_SET_EXTEND
#undef RT_SWITCH_NODE_KIND
--
2.39.0
v21-0009-Remove-hard-coded-128.patchtext/x-patch; charset=US-ASCII; name=v21-0009-Remove-hard-coded-128.patchDownload
From 5b4fff91055335d5dcc22ae6eee26168cf889486 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Jan 2023 15:51:21 +0700
Subject: [PATCH v21 09/22] Remove hard-coded 128
Also comment that 64 could be a valid number of bits
in the bitmap for this node type.
TODO: Consider whether we should in fact limit this
node to ~64.
In passing, remove "125" from invalid-slot-index macro.
---
src/include/lib/radixtree.h | 19 +++++++++++++------
src/include/lib/radixtree_delete_impl.h | 4 ++--
src/include/lib/radixtree_insert_impl.h | 4 ++--
src/include/lib/radixtree_search_impl.h | 4 ++--
4 files changed, 19 insertions(+), 12 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 172d62c6b0..d15ea8f0fe 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -270,8 +270,15 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
/* Tree level the radix tree uses */
#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
/* Invalid index used in node-125 */
-#define RT_NODE_125_INVALID_IDX 0xFF
+#define RT_INVALID_SLOT_IDX 0xFF
/* Get a chunk from the key */
#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
@@ -409,7 +416,7 @@ typedef struct RT_NODE_BASE_125
uint8 slot_idxs[RT_NODE_MAX_SLOTS];
/* isset is a bitmap to track which slot is in use */
- bitmapword isset[BM_IDX(128)];
+ bitmapword isset[BM_IDX(RT_SLOT_IDX_LIMIT)];
} RT_NODE_BASE_125;
typedef struct RT_NODE_BASE_256
@@ -867,7 +874,7 @@ RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
static inline bool
RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
{
- return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+ return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
}
static inline RT_PTR_ALLOC
@@ -881,7 +888,7 @@ static inline RT_VALUE_TYPE
RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
{
Assert(NODE_IS_LEAF(node));
- Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+ Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
return node->values[node->base.slot_idxs[chunk]];
}
@@ -1037,7 +1044,7 @@ RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner
{
RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
- memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+ memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
}
}
@@ -2052,7 +2059,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
fprintf(stderr, ", isset-bitmap:");
- for (int i = 0; i < BM_IDX(128); i++)
+ for (int i = 0; i < BM_IDX(RT_SLOT_IDX_LIMIT); i++)
{
fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
}
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index 2612730481..2f1c172672 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -65,13 +65,13 @@
int idx;
int bitnum;
- if (slotpos == RT_NODE_125_INVALID_IDX)
+ if (slotpos == RT_INVALID_SLOT_IDX)
return false;
idx = BM_IDX(slotpos);
bitnum = BM_BIT(slotpos);
n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
- n125->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+ n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
break;
}
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index e3e44669ea..90fe5f539e 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -201,7 +201,7 @@
int slotpos = n125->base.slot_idxs[chunk];
int cnt = 0;
- if (slotpos != RT_NODE_125_INVALID_IDX)
+ if (slotpos != RT_INVALID_SLOT_IDX)
{
/* found the existing chunk */
chunk_exists = true;
@@ -247,7 +247,7 @@
bitmapword inverse;
/* get the first word with at least one bit not set */
- for (idx = 0; idx < BM_IDX(128); idx++)
+ for (idx = 0; idx < BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
{
if (n125->base.isset[idx] < ~((bitmapword) 0))
break;
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index 365abaa46d..d2bbdd2450 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -73,10 +73,10 @@
int slotpos = n125->base.slot_idxs[chunk];
#ifdef RT_ACTION_UPDATE
- Assert(slotpos != RT_NODE_125_INVALID_IDX);
+ Assert(slotpos != RT_INVALID_SLOT_IDX);
n125->children[slotpos] = new_child;
#else
- if (slotpos == RT_NODE_125_INVALID_IDX)
+ if (slotpos == RT_INVALID_SLOT_IDX)
return false;
#ifdef RT_NODE_LEVEL_LEAF
--
2.39.0
v21-0008-Streamline-calculation-of-slab-blocksize.patchtext/x-patch; charset=US-ASCII; name=v21-0008-Streamline-calculation-of-slab-blocksize.patchDownload
From e02aa8c8c9d36f2d45f91c462e3f55f5f39428e6 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Jan 2023 14:55:25 +0700
Subject: [PATCH v21 08/22] Streamline calculation of slab blocksize
To reduce duplication. This will likely lead to
division instructions, but a few cycles won't
matter at all when creating the tree.
---
src/include/lib/radixtree.h | 50 ++++++++++++++-----------------------
1 file changed, 19 insertions(+), 31 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 0a39bd6664..172d62c6b0 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -304,6 +304,13 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
#define RT_NODE_KIND_256 0x03
#define RT_NODE_KIND_COUNT 4
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
#endif /* RT_COMMON */
@@ -503,59 +510,38 @@ typedef struct RT_SIZE_CLASS_ELEM
/* slab chunk size */
Size inner_size;
Size leaf_size;
-
- /* slab block size */
- Size inner_blocksize;
- Size leaf_blocksize;
} RT_SIZE_CLASS_ELEM;
-/*
- * Calculate the slab blocksize so that we can allocate at least 32 chunks
- * from the block.
- */
-#define NODE_SLAB_BLOCK_SIZE(size) \
- Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
-
static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
[RT_CLASS_4_FULL] = {
.name = "radix tree node 4",
.fanout = 4,
.inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
.leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(RT_VALUE_TYPE),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(RT_VALUE_TYPE)),
},
[RT_CLASS_32_PARTIAL] = {
.name = "radix tree node 15",
.fanout = 15,
.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE)),
},
[RT_CLASS_32_FULL] = {
.name = "radix tree node 32",
.fanout = 32,
.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE)),
},
[RT_CLASS_125_FULL] = {
.name = "radix tree node 125",
.fanout = 125,
.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE)),
},
[RT_CLASS_256] = {
.name = "radix tree node 256",
.fanout = 256,
.inner_size = sizeof(RT_NODE_INNER_256),
.leaf_size = sizeof(RT_NODE_LEAF_256),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_256)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_256)),
},
};
@@ -1361,14 +1347,18 @@ RT_CREATE(MemoryContext ctx)
/* Create the slab allocator for each size class */
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
+ RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+ size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+ size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
tree->inner_slabs[i] = SlabContextCreate(ctx,
- RT_SIZE_CLASS_INFO[i].name,
- RT_SIZE_CLASS_INFO[i].inner_blocksize,
- RT_SIZE_CLASS_INFO[i].inner_size);
+ size_class.name,
+ inner_blocksize,
+ size_class.inner_size);
tree->leaf_slabs[i] = SlabContextCreate(ctx,
- RT_SIZE_CLASS_INFO[i].name,
- RT_SIZE_CLASS_INFO[i].leaf_blocksize,
- RT_SIZE_CLASS_INFO[i].leaf_size);
+ size_class.name,
+ leaf_blocksize,
+ size_class.leaf_size);
}
#endif
@@ -2189,12 +2179,10 @@ RT_DUMP(RT_RADIX_TREE *tree)
{
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
- fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+ fprintf(stderr, "%s\tinner_size %zu\tleaf_size %zu\t%zu\n",
RT_SIZE_CLASS_INFO[i].name,
RT_SIZE_CLASS_INFO[i].inner_size,
- RT_SIZE_CLASS_INFO[i].inner_blocksize,
- RT_SIZE_CLASS_INFO[i].leaf_size,
- RT_SIZE_CLASS_INFO[i].leaf_blocksize);
+ RT_SIZE_CLASS_INFO[i].leaf_size);
fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
if (!tree->ctl->root)
--
2.39.0
v21-0010-Reduce-node4-to-node3.patchtext/x-patch; charset=US-ASCII; name=v21-0010-Reduce-node4-to-node3.patchDownload
From 1226982cc3c3ac779953de4afb6e85f31be11a28 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Jan 2023 18:05:15 +0700
Subject: [PATCH v21 10/22] Reduce node4 to node3
Now that we don't store "chunk", the base node type is only
5 bytes in size. With 3 key chunks, There is no alignment
padding between the chunks array and the child/value array.
This reduces the smallest inner node to 32 bytes on 64-bit
platforms.
---
src/include/lib/radixtree.h | 124 ++++++++++++------------
src/include/lib/radixtree_delete_impl.h | 20 ++--
src/include/lib/radixtree_insert_impl.h | 38 ++++----
src/include/lib/radixtree_iter_impl.h | 18 ++--
src/include/lib/radixtree_search_impl.h | 18 ++--
5 files changed, 109 insertions(+), 109 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d15ea8f0fe..6cc8442c89 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -136,9 +136,9 @@
#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
-#define RT_NODE_4_SEARCH_EQ RT_MAKE_NAME(node_4_search_eq)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
-#define RT_NODE_4_GET_INSERTPOS RT_MAKE_NAME(node_4_get_insertpos)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
@@ -181,22 +181,22 @@
#endif
#define RT_NODE RT_MAKE_NAME(node)
#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
-#define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
-#define RT_NODE_INNER_4 RT_MAKE_NAME(node_inner_4)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
-#define RT_NODE_LEAF_4 RT_MAKE_NAME(node_leaf_4)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
-#define RT_CLASS_4_FULL RT_MAKE_NAME(class_4_full)
+#define RT_CLASS_3_FULL RT_MAKE_NAME(class_3_full)
#define RT_CLASS_32_PARTIAL RT_MAKE_NAME(class_32_partial)
#define RT_CLASS_32_FULL RT_MAKE_NAME(class_32_full)
#define RT_CLASS_125_FULL RT_MAKE_NAME(class_125_full)
@@ -305,7 +305,7 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
* allocator padding in both the inner and leaf nodes on DSA.
* node
*/
-#define RT_NODE_KIND_4 0x00
+#define RT_NODE_KIND_3 0x00
#define RT_NODE_KIND_32 0x01
#define RT_NODE_KIND_125 0x02
#define RT_NODE_KIND_256 0x03
@@ -323,7 +323,7 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
typedef enum RT_SIZE_CLASS
{
- RT_CLASS_4_FULL = 0,
+ RT_CLASS_3_FULL = 0,
RT_CLASS_32_PARTIAL,
RT_CLASS_32_FULL,
RT_CLASS_125_FULL,
@@ -387,13 +387,13 @@ typedef struct RT_NODE
/* Base type of each node kinds for leaf and inner nodes */
/* The base types must be a be able to accommodate the largest size
class for variable-sized node kinds*/
-typedef struct RT_NODE_BASE_4
+typedef struct RT_NODE_BASE_3
{
RT_NODE n;
- /* 4 children, for key chunks */
- uint8 chunks[4];
-} RT_NODE_BASE_4;
+ /* 3 children, for key chunks */
+ uint8 chunks[3];
+} RT_NODE_BASE_3;
typedef struct RT_NODE_BASE_32
{
@@ -437,21 +437,21 @@ typedef struct RT_NODE_BASE_256
* good. It might be better to just indicate non-existing entries the same way
* in inner nodes.
*/
-typedef struct RT_NODE_INNER_4
+typedef struct RT_NODE_INNER_3
{
- RT_NODE_BASE_4 base;
+ RT_NODE_BASE_3 base;
/* number of children depends on size class */
RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
-} RT_NODE_INNER_4;
+} RT_NODE_INNER_3;
-typedef struct RT_NODE_LEAF_4
+typedef struct RT_NODE_LEAF_3
{
- RT_NODE_BASE_4 base;
+ RT_NODE_BASE_3 base;
/* number of values depends on size class */
RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
-} RT_NODE_LEAF_4;
+} RT_NODE_LEAF_3;
typedef struct RT_NODE_INNER_32
{
@@ -520,11 +520,11 @@ typedef struct RT_SIZE_CLASS_ELEM
} RT_SIZE_CLASS_ELEM;
static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
- [RT_CLASS_4_FULL] = {
- .name = "radix tree node 4",
- .fanout = 4,
- .inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
- .leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(RT_VALUE_TYPE),
+ [RT_CLASS_3_FULL] = {
+ .name = "radix tree node 3",
+ .fanout = 3,
+ .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
},
[RT_CLASS_32_PARTIAL] = {
.name = "radix tree node 15",
@@ -556,7 +556,7 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
/* Map from the node kind to its minimum size class */
static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
- [RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+ [RT_NODE_KIND_3] = RT_CLASS_3_FULL,
[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
[RT_NODE_KIND_256] = RT_CLASS_256,
@@ -673,7 +673,7 @@ RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
* if there is no such element.
*/
static inline int
-RT_NODE_4_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
{
int idx = -1;
@@ -693,7 +693,7 @@ RT_NODE_4_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
* Return index of the chunk to insert into chunks in the given node.
*/
static inline int
-RT_NODE_4_GET_INSERTPOS(RT_NODE_BASE_4 *node, uint8 chunk)
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
{
int idx;
@@ -810,7 +810,7 @@ RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
/*
* Functions to manipulate both chunks array and children/values array.
- * These are used for node-4 and node-32.
+ * These are used for node-3 and node-32.
*/
/* Shift the elements right at 'idx' by one */
@@ -848,7 +848,7 @@ static inline void
RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
{
- const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_FULL].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
@@ -860,7 +860,7 @@ static inline void
RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
{
- const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_FULL].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
@@ -1060,9 +1060,9 @@ RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL newnode;
- allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, inner);
newnode = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3_FULL, inner);
newnode->shift = shift;
tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
tree->ctl->root = allocnode;
@@ -1183,17 +1183,17 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
{
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL node;
- RT_NODE_INNER_4 *n4;
+ RT_NODE_INNER_3 *n3;
- allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, true);
node = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+ RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3_FULL, true);
node->shift = shift;
node->count = 1;
- n4 = (RT_NODE_INNER_4 *) node;
- n4->base.chunks[0] = 0;
- n4->children[0] = tree->ctl->root;
+ n3 = (RT_NODE_INNER_3 *) node;
+ n3->base.chunks[0] = 0;
+ n3->children[0] = tree->ctl->root;
/* Update the root */
tree->ctl->root = allocnode;
@@ -1223,9 +1223,9 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value, RT_PTR_LOCAL
int newshift = shift - RT_NODE_SPAN;
bool inner = newshift > 0;
- allocchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, inner);
newchild = RT_PTR_GET_LOCAL(tree, allocchild);
- RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3_FULL, inner);
newchild->shift = newshift;
RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
@@ -1430,12 +1430,12 @@ RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
switch (node->kind)
{
- case RT_NODE_KIND_4:
+ case RT_NODE_KIND_3:
{
- RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
- for (int i = 0; i < n4->base.n.count; i++)
- RT_FREE_RECURSE(tree, n4->children[i]);
+ for (int i = 0; i < n3->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n3->children[i]);
break;
}
@@ -1892,12 +1892,12 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
switch (node->kind)
{
- case RT_NODE_KIND_4:
+ case RT_NODE_KIND_3:
{
- RT_NODE_BASE_4 *n4 = (RT_NODE_BASE_4 *) node;
+ RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
- for (int i = 1; i < n4->n.count; i++)
- Assert(n4->chunks[i - 1] < n4->chunks[i]);
+ for (int i = 1; i < n3->n.count; i++)
+ Assert(n3->chunks[i - 1] < n3->chunks[i]);
break;
}
@@ -1959,10 +1959,10 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
RT_SCOPE void
RT_STATS(RT_RADIX_TREE *tree)
{
- ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+ ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
tree->ctl->num_keys,
tree->ctl->root->shift / RT_NODE_SPAN,
- tree->ctl->cnt[RT_CLASS_4_FULL],
+ tree->ctl->cnt[RT_CLASS_3_FULL],
tree->ctl->cnt[RT_CLASS_32_PARTIAL],
tree->ctl->cnt[RT_CLASS_32_FULL],
tree->ctl->cnt[RT_CLASS_125_FULL],
@@ -1977,7 +1977,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
NODE_IS_LEAF(node) ? "LEAF" : "INNR",
- (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_3) ? 3 :
(node->kind == RT_NODE_KIND_32) ? 32 :
(node->kind == RT_NODE_KIND_125) ? 125 : 256,
node->fanout == 0 ? 256 : node->fanout,
@@ -1988,26 +1988,26 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
switch (node->kind)
{
- case RT_NODE_KIND_4:
+ case RT_NODE_KIND_3:
{
for (int i = 0; i < node->count; i++)
{
if (NODE_IS_LEAF(node))
{
- RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
+ RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
- space, n4->base.chunks[i], (uint64) n4->values[i]);
+ space, n3->base.chunks[i], (uint64) n3->values[i]);
}
else
{
- RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
fprintf(stderr, "%schunk 0x%X ->",
- space, n4->base.chunks[i]);
+ space, n3->base.chunks[i]);
if (recurse)
- RT_DUMP_NODE(n4->children[i], level + 1, recurse);
+ RT_DUMP_NODE(n3->children[i], level + 1, recurse);
else
fprintf(stderr, "\n");
}
@@ -2229,22 +2229,22 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_ITER
#undef RT_NODE
#undef RT_NODE_ITER
-#undef RT_NODE_BASE_4
+#undef RT_NODE_BASE_3
#undef RT_NODE_BASE_32
#undef RT_NODE_BASE_125
#undef RT_NODE_BASE_256
-#undef RT_NODE_INNER_4
+#undef RT_NODE_INNER_3
#undef RT_NODE_INNER_32
#undef RT_NODE_INNER_125
#undef RT_NODE_INNER_256
-#undef RT_NODE_LEAF_4
+#undef RT_NODE_LEAF_3
#undef RT_NODE_LEAF_32
#undef RT_NODE_LEAF_125
#undef RT_NODE_LEAF_256
#undef RT_SIZE_CLASS
#undef RT_SIZE_CLASS_ELEM
#undef RT_SIZE_CLASS_INFO
-#undef RT_CLASS_4_FULL
+#undef RT_CLASS_3_FULL
#undef RT_CLASS_32_PARTIAL
#undef RT_CLASS_32_FULL
#undef RT_CLASS_125_FULL
@@ -2282,9 +2282,9 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_REPLACE_NODE
#undef RT_PTR_GET_LOCAL
#undef RT_PTR_ALLOC_IS_VALID
-#undef RT_NODE_4_SEARCH_EQ
+#undef RT_NODE_3_SEARCH_EQ
#undef RT_NODE_32_SEARCH_EQ
-#undef RT_NODE_4_GET_INSERTPOS
+#undef RT_NODE_3_GET_INSERTPOS
#undef RT_NODE_32_GET_INSERTPOS
#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
#undef RT_CHUNK_VALUES_ARRAY_SHIFT
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index 2f1c172672..b9f07f4eb5 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -1,12 +1,12 @@
/* TODO: shrink nodes */
#if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE3_TYPE RT_NODE_INNER_3
#define RT_NODE32_TYPE RT_NODE_INNER_32
#define RT_NODE125_TYPE RT_NODE_INNER_125
#define RT_NODE256_TYPE RT_NODE_INNER_256
#elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
#define RT_NODE32_TYPE RT_NODE_LEAF_32
#define RT_NODE125_TYPE RT_NODE_LEAF_125
#define RT_NODE256_TYPE RT_NODE_LEAF_256
@@ -24,20 +24,20 @@
switch (node->kind)
{
- case RT_NODE_KIND_4:
+ case RT_NODE_KIND_3:
{
- RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
- int idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
if (idx < 0)
return false;
#ifdef RT_NODE_LEVEL_LEAF
- RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, n4->values,
- n4->base.n.count, idx);
+ RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+ n3->base.n.count, idx);
#else
- RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
- n4->base.n.count, idx);
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+ n3->base.n.count, idx);
#endif
break;
}
@@ -100,7 +100,7 @@
return true;
-#undef RT_NODE4_TYPE
+#undef RT_NODE3_TYPE
#undef RT_NODE32_TYPE
#undef RT_NODE125_TYPE
#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 90fe5f539e..16461bdb03 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -1,10 +1,10 @@
#if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE3_TYPE RT_NODE_INNER_3
#define RT_NODE32_TYPE RT_NODE_INNER_32
#define RT_NODE125_TYPE RT_NODE_INNER_125
#define RT_NODE256_TYPE RT_NODE_INNER_256
#elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
#define RT_NODE32_TYPE RT_NODE_LEAF_32
#define RT_NODE125_TYPE RT_NODE_LEAF_125
#define RT_NODE256_TYPE RT_NODE_LEAF_256
@@ -25,25 +25,25 @@
switch (node->kind)
{
- case RT_NODE_KIND_4:
+ case RT_NODE_KIND_3:
{
- RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
int idx;
- idx = RT_NODE_4_SEARCH_EQ(&n4->base, chunk);
+ idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
if (idx != -1)
{
/* found the existing chunk */
chunk_exists = true;
#ifdef RT_NODE_LEVEL_LEAF
- n4->values[idx] = value;
+ n3->values[idx] = value;
#else
- n4->children[idx] = child;
+ n3->children[idx] = child;
#endif
break;
}
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n3)))
{
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL newnode;
@@ -51,16 +51,16 @@
const uint8 new_kind = RT_NODE_KIND_32;
const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
- /* grow node from 4 to 32 */
+ /* grow node from 3 to 32 */
allocnode = RT_ALLOC_NODE(tree, new_class, inner);
newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
new32 = (RT_NODE32_TYPE *) newnode;
#ifdef RT_NODE_LEVEL_LEAF
- RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
+ RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
new32->base.chunks, new32->values);
#else
- RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
+ RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
new32->base.chunks, new32->children);
#endif
RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
@@ -68,27 +68,27 @@
}
else
{
- int insertpos = RT_NODE_4_GET_INSERTPOS(&n4->base, chunk);
- int count = n4->base.n.count;
+ int insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+ int count = n3->base.n.count;
/* shift chunks and children */
if (insertpos < count)
{
Assert(count > 0);
#ifdef RT_NODE_LEVEL_LEAF
- RT_CHUNK_VALUES_ARRAY_SHIFT(n4->base.chunks, n4->values,
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
count, insertpos);
#else
- RT_CHUNK_CHILDREN_ARRAY_SHIFT(n4->base.chunks, n4->children,
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
count, insertpos);
#endif
}
- n4->base.chunks[insertpos] = chunk;
+ n3->base.chunks[insertpos] = chunk;
#ifdef RT_NODE_LEVEL_LEAF
- n4->values[insertpos] = value;
+ n3->values[insertpos] = value;
#else
- n4->children[insertpos] = child;
+ n3->children[insertpos] = child;
#endif
break;
}
@@ -304,7 +304,7 @@
return chunk_exists;
-#undef RT_NODE4_TYPE
+#undef RT_NODE3_TYPE
#undef RT_NODE32_TYPE
#undef RT_NODE125_TYPE
#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index 5c06f8b414..c428531438 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -1,10 +1,10 @@
#if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE3_TYPE RT_NODE_INNER_3
#define RT_NODE32_TYPE RT_NODE_INNER_32
#define RT_NODE125_TYPE RT_NODE_INNER_125
#define RT_NODE256_TYPE RT_NODE_INNER_256
#elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
#define RT_NODE32_TYPE RT_NODE_LEAF_32
#define RT_NODE125_TYPE RT_NODE_LEAF_125
#define RT_NODE256_TYPE RT_NODE_LEAF_256
@@ -31,19 +31,19 @@
switch (node_iter->node->kind)
{
- case RT_NODE_KIND_4:
+ case RT_NODE_KIND_3:
{
- RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node_iter->node;
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
node_iter->current_idx++;
- if (node_iter->current_idx >= n4->base.n.count)
+ if (node_iter->current_idx >= n3->base.n.count)
break;
#ifdef RT_NODE_LEVEL_LEAF
- value = n4->values[node_iter->current_idx];
+ value = n3->values[node_iter->current_idx];
#else
- child = RT_PTR_GET_LOCAL(iter->tree, n4->children[node_iter->current_idx]);
+ child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
#endif
- key_chunk = n4->base.chunks[node_iter->current_idx];
+ key_chunk = n3->base.chunks[node_iter->current_idx];
found = true;
break;
}
@@ -132,7 +132,7 @@
return child;
#endif
-#undef RT_NODE4_TYPE
+#undef RT_NODE3_TYPE
#undef RT_NODE32_TYPE
#undef RT_NODE125_TYPE
#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index d2bbdd2450..31138b6a72 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -1,10 +1,10 @@
#if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE3_TYPE RT_NODE_INNER_3
#define RT_NODE32_TYPE RT_NODE_INNER_32
#define RT_NODE125_TYPE RT_NODE_INNER_125
#define RT_NODE256_TYPE RT_NODE_INNER_256
#elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
#define RT_NODE32_TYPE RT_NODE_LEAF_32
#define RT_NODE125_TYPE RT_NODE_LEAF_125
#define RT_NODE256_TYPE RT_NODE_LEAF_256
@@ -27,22 +27,22 @@
switch (node->kind)
{
- case RT_NODE_KIND_4:
+ case RT_NODE_KIND_3:
{
- RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
- int idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
#ifdef RT_ACTION_UPDATE
Assert(idx >= 0);
- n4->children[idx] = new_child;
+ n3->children[idx] = new_child;
#else
if (idx < 0)
return false;
#ifdef RT_NODE_LEVEL_LEAF
- value = n4->values[idx];
+ value = n3->values[idx];
#else
- child = n4->children[idx];
+ child = n3->children[idx];
#endif
#endif /* RT_ACTION_UPDATE */
break;
@@ -125,7 +125,7 @@
return true;
#endif /* RT_ACTION_UPDATE */
-#undef RT_NODE4_TYPE
+#undef RT_NODE3_TYPE
#undef RT_NODE32_TYPE
#undef RT_NODE125_TYPE
#undef RT_NODE256_TYPE
--
2.39.0
v21-0007-Make-value-type-configurable.patchtext/x-patch; charset=US-ASCII; name=v21-0007-Make-value-type-configurable.patchDownload
From 9b6adf0d916cb4bce3dd0e329b59cd1b27013e67 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Jan 2023 14:19:15 +0700
Subject: [PATCH v21 07/22] Make value type configurable
Tests pass with uint32, although the test module builds
with warnings.
---
src/include/lib/radixtree.h | 79 ++++++++++---------
src/include/lib/radixtree_delete_impl.h | 4 +-
src/include/lib/radixtree_iter_impl.h | 2 +-
src/include/lib/radixtree_search_impl.h | 2 +-
.../modules/test_radixtree/test_radixtree.c | 41 ++++++----
5 files changed, 69 insertions(+), 59 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 98e4597eac..0a39bd6664 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -44,6 +44,7 @@
* declarations reside
* - RT_SHMEM - if defined, the radix tree is created in the DSA area
* so that multiple processes can access it simultaneously.
+ * - RT_VALUE_TYPE - the type of the value.
*
* Optional parameters:
* - RT_DEBUG - if defined add stats tracking and debugging functions
@@ -222,14 +223,14 @@ RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
#endif
RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
-RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
-RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *val_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE val);
#ifdef RT_USE_DELETE
RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
#endif
RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
-RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
@@ -435,7 +436,7 @@ typedef struct RT_NODE_LEAF_4
RT_NODE_BASE_4 base;
/* number of values depends on size class */
- uint64 values[FLEXIBLE_ARRAY_MEMBER];
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
} RT_NODE_LEAF_4;
typedef struct RT_NODE_INNER_32
@@ -451,7 +452,7 @@ typedef struct RT_NODE_LEAF_32
RT_NODE_BASE_32 base;
/* number of values depends on size class */
- uint64 values[FLEXIBLE_ARRAY_MEMBER];
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
} RT_NODE_LEAF_32;
typedef struct RT_NODE_INNER_125
@@ -467,7 +468,7 @@ typedef struct RT_NODE_LEAF_125
RT_NODE_BASE_125 base;
/* number of values depends on size class */
- uint64 values[FLEXIBLE_ARRAY_MEMBER];
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
} RT_NODE_LEAF_125;
/*
@@ -490,7 +491,7 @@ typedef struct RT_NODE_LEAF_256
bitmapword isset[BM_IDX(RT_NODE_MAX_SLOTS)];
/* Slots for 256 values */
- uint64 values[RT_NODE_MAX_SLOTS];
+ RT_VALUE_TYPE values[RT_NODE_MAX_SLOTS];
} RT_NODE_LEAF_256;
/* Information for each size class */
@@ -520,33 +521,33 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
.name = "radix tree node 4",
.fanout = 4,
.inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
- .leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64),
+ .leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(RT_VALUE_TYPE),
.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(RT_VALUE_TYPE)),
},
[RT_CLASS_32_PARTIAL] = {
.name = "radix tree node 15",
.fanout = 15,
.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
- .leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE)),
},
[RT_CLASS_32_FULL] = {
.name = "radix tree node 32",
.fanout = 32,
.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
- .leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE)),
},
[RT_CLASS_125_FULL] = {
.name = "radix tree node 125",
.fanout = 125,
.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
- .leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64),
+ .leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE)),
},
[RT_CLASS_256] = {
.name = "radix tree node 256",
@@ -648,7 +649,7 @@ typedef struct RT_ITER
static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
uint64 key, RT_PTR_ALLOC child);
static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
- uint64 key, uint64 value);
+ uint64 key, RT_VALUE_TYPE value);
/* verification (available only with assertion) */
static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
@@ -828,10 +829,10 @@ RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count,
}
static inline void
-RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, uint64 *values, int count, int idx)
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
{
memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
- memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
}
/* Delete the element at 'idx' */
@@ -843,10 +844,10 @@ RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count,
}
static inline void
-RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, uint64 *values, int count, int idx)
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
{
memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
- memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
}
/* Copy both chunks and children/values arrays */
@@ -863,12 +864,12 @@ RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
}
static inline void
-RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, uint64 *src_values,
- uint8 *dst_chunks, uint64 *dst_values)
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+ uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
{
const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
- const Size values_size = sizeof(uint64) * fanout;
+ const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
memcpy(dst_chunks, src_chunks, chunk_size);
memcpy(dst_values, src_values, values_size);
@@ -890,7 +891,7 @@ RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
return node->children[node->base.slot_idxs[chunk]];
}
-static inline uint64
+static inline RT_VALUE_TYPE
RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
{
Assert(NODE_IS_LEAF(node));
@@ -926,7 +927,7 @@ RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
return node->children[chunk];
}
-static inline uint64
+static inline RT_VALUE_TYPE
RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
{
Assert(NODE_IS_LEAF(node));
@@ -944,7 +945,7 @@ RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
/* Set the value in the node-256 */
static inline void
-RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, uint64 value)
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
{
int idx = BM_IDX(chunk);
int bitnum = BM_BIT(chunk);
@@ -1215,7 +1216,7 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
* Insert inner and leaf nodes from 'node' to bottom.
*/
static inline void
-RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value, RT_PTR_LOCAL parent,
RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
{
int shift = node->shift;
@@ -1266,7 +1267,7 @@ RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
* to the value is set to value_p.
*/
static inline bool
-RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, uint64 *value_p)
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
{
#define RT_NODE_LEVEL_LEAF
#include "lib/radixtree_search_impl.h"
@@ -1320,7 +1321,7 @@ RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stor
/* Like, RT_NODE_INSERT_INNER, but for leaf nodes */
static bool
RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
- uint64 key, uint64 value)
+ uint64 key, RT_VALUE_TYPE value)
{
#define RT_NODE_LEVEL_LEAF
#include "lib/radixtree_insert_impl.h"
@@ -1522,7 +1523,7 @@ RT_FREE(RT_RADIX_TREE *tree)
* and return true. Returns false if entry doesn't yet exist.
*/
RT_SCOPE bool
-RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
{
int shift;
bool updated;
@@ -1582,7 +1583,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
* not be NULL.
*/
RT_SCOPE bool
-RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
{
RT_PTR_LOCAL node;
int shift;
@@ -1730,7 +1731,7 @@ RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
*/
static inline bool
RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
- uint64 *value_p)
+ RT_VALUE_TYPE *value_p)
{
#define RT_NODE_LEVEL_LEAF
#include "lib/radixtree_iter_impl.h"
@@ -1803,7 +1804,7 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
* return false.
*/
RT_SCOPE bool
-RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
{
/* Empty tree */
if (!iter->tree->ctl->root)
@@ -1812,7 +1813,7 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
for (;;)
{
RT_PTR_LOCAL child = NULL;
- uint64 value;
+ RT_VALUE_TYPE value;
int level;
bool found;
@@ -1971,6 +1972,7 @@ RT_STATS(RT_RADIX_TREE *tree)
tree->ctl->cnt[RT_CLASS_256])));
}
+/* XXX For display, assumes value type is numeric */
static void
RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
{
@@ -1998,7 +2000,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
- space, n4->base.chunks[i], n4->values[i]);
+ space, n4->base.chunks[i], (uint64) n4->values[i]);
}
else
{
@@ -2024,7 +2026,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
- space, n32->base.chunks[i], n32->values[i]);
+ space, n32->base.chunks[i], (uint64) n32->values[i]);
}
else
{
@@ -2077,7 +2079,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
- space, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+ space, i, (uint64) RT_NODE_LEAF_125_GET_VALUE(n125, i));
}
else
{
@@ -2107,7 +2109,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
continue;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
- space, i, RT_NODE_LEAF_256_GET_VALUE(n256, i));
+ space, i, (uint64) RT_NODE_LEAF_256_GET_VALUE(n256, i));
}
else
{
@@ -2213,6 +2215,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_SCOPE
#undef RT_DECLARE
#undef RT_DEFINE
+#undef RT_VALUE_TYPE
/* locally declared macros */
#undef NODE_IS_LEAF
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index eb87866b90..2612730481 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -33,7 +33,7 @@
return false;
#ifdef RT_NODE_LEVEL_LEAF
- RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, (uint64 *) n4->values,
+ RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, n4->values,
n4->base.n.count, idx);
#else
RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
@@ -50,7 +50,7 @@
return false;
#ifdef RT_NODE_LEVEL_LEAF
- RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, (uint64 *) n32->values,
+ RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
n32->base.n.count, idx);
#else
RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index 0b8b68df6c..5c06f8b414 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -16,7 +16,7 @@
uint8 key_chunk;
#ifdef RT_NODE_LEVEL_LEAF
- uint64 value;
+ RT_VALUE_TYPE value;
Assert(NODE_IS_LEAF(node_iter->node));
#else
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index 31e4978e4f..365abaa46d 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -15,7 +15,7 @@
uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
#ifdef RT_NODE_LEVEL_LEAF
- uint64 value = 0;
+ RT_VALUE_TYPE value = 0;
Assert(NODE_IS_LEAF(node));
#else
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index d8323f587f..64d46dfe9a 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -24,6 +24,12 @@
#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
/*
* If you enable this, the "pattern" tests will print information about
* how long populating, probing, and iterating the test set takes, and
@@ -105,6 +111,7 @@ static const test_spec test_specs[] = {
#define RT_DECLARE
#define RT_DEFINE
#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
// WIP: compiles with warnings because rt_attach is defined but not used
// #define RT_SHMEM
#include "lib/radixtree.h"
@@ -128,9 +135,9 @@ test_empty(void)
{
rt_radix_tree *radixtree;
rt_iter *iter;
- uint64 dummy;
+ TestValueType dummy;
uint64 key;
- uint64 val;
+ TestValueType val;
#ifdef RT_SHMEM
int tranche_id = LWLockNewTrancheId();
@@ -202,26 +209,26 @@ test_basic(int children, bool test_inner)
/* insert keys */
for (int i = 0; i < children; i++)
{
- if (rt_set(radixtree, keys[i], keys[i]))
+ if (rt_set(radixtree, keys[i], (TestValueType) keys[i]))
elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
}
/* look up keys */
for (int i = 0; i < children; i++)
{
- uint64 value;
+ TestValueType value;
if (!rt_search(radixtree, keys[i], &value))
elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
- if (value != keys[i])
+ if (value != (TestValueType) keys[i])
elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
- value, keys[i]);
+ value, (TestValueType) keys[i]);
}
/* update keys */
for (int i = 0; i < children; i++)
{
- if (!rt_set(radixtree, keys[i], keys[i] + 1))
+ if (!rt_set(radixtree, keys[i], (TestValueType) (keys[i] + 1)))
elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
}
@@ -230,7 +237,7 @@ test_basic(int children, bool test_inner)
{
if (!rt_delete(radixtree, keys[i]))
elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
- if (rt_set(radixtree, keys[i], keys[i]))
+ if (rt_set(radixtree, keys[i], (TestValueType) keys[i]))
elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
}
@@ -248,12 +255,12 @@ check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
for (int i = start; i < end; i++)
{
uint64 key = ((uint64) i << shift);
- uint64 val;
+ TestValueType val;
if (!rt_search(radixtree, key, &val))
elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
key, end);
- if (val != key)
+ if (val != (TestValueType) key)
elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
key, val, key);
}
@@ -274,7 +281,7 @@ test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
uint64 key = ((uint64) i << shift);
bool found;
- found = rt_set(radixtree, key, key);
+ found = rt_set(radixtree, key, (TestValueType) key);
if (found)
elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
@@ -440,7 +447,7 @@ test_pattern(const test_spec * spec)
x = last_int + pattern_values[i];
- found = rt_set(radixtree, x, x);
+ found = rt_set(radixtree, x, (TestValueType) x);
if (found)
elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
@@ -495,7 +502,7 @@ test_pattern(const test_spec * spec)
bool found;
bool expected;
uint64 x;
- uint64 v;
+ TestValueType v;
/*
* Pick next value to probe at random. We limit the probes to the
@@ -526,7 +533,7 @@ test_pattern(const test_spec * spec)
if (found != expected)
elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
- if (found && (v != x))
+ if (found && (v != (TestValueType) x))
elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
v, x);
}
@@ -549,7 +556,7 @@ test_pattern(const test_spec * spec)
{
uint64 expected = last_int + pattern_values[i];
uint64 x;
- uint64 val;
+ TestValueType val;
if (!rt_iterate_next(iter, &x, &val))
break;
@@ -558,7 +565,7 @@ test_pattern(const test_spec * spec)
elog(ERROR,
"iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
x, expected, i);
- if (val != expected)
+ if (val != (TestValueType) expected)
elog(ERROR,
"iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
n++;
@@ -588,7 +595,7 @@ test_pattern(const test_spec * spec)
{
bool found;
uint64 x;
- uint64 v;
+ TestValueType v;
/*
* Pick next value to probe at random. We limit the probes to the
--
2.39.0
v21-0012-Tool-for-measuring-radix-tree-performance.patchtext/x-patch; charset=US-ASCII; name=v21-0012-Tool-for-measuring-radix-tree-performance.patchDownload
From 96efc422dd1858951de4e41563c081b1d2faaf5f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v21 12/22] Tool for measuring radix tree performance
Includes Meson support, but commented out to avoid warnings
XXX: Not for commit
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 76 ++
contrib/bench_radix_tree/bench_radix_tree.c | 656 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/meson.build | 33 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
contrib/meson.build | 1 +
8 files changed, 822 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/meson.build
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..2fd689aa91
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..4c785c7336
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,656 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ rt_radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 search_time_ms;
+ Datum values[2] = {0};
+ bool nulls[2] = {0};
+ /* from trial and error */
+ uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (!PG_ARGISNULL(1))
+ {
+ char *filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+ if (sscanf(filter_str, "0x%lX", &filter) == 0)
+ elog(ERROR, "invalid filter string %s", filter_str);
+ }
+ elog(NOTICE, "bench with filter 0x%lX", filter);
+
+ rt = rt_create(CurrentMemoryContext);
+
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+
+ rt_set(rt, key, key);
+ }
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_sparseload_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ uint64 r,
+ h;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ key_id = 0;
+
+ for (r = 1; r <= fanout; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (h);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+
+ rt_stats(rt);
+
+ /* measure sparse deletion and re-loading */
+ start_time = GetCurrentTimestamp();
+
+ for (int t = 0; t<10000; t++)
+ {
+ /* delete one key in each leaf */
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ rt_delete(rt, key);
+ }
+
+ /* add them all back */
+ key_id = 0;
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_sparseload_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+ 'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+ bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'bench_radix_tree',
+ '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+ bench_radix_tree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+ 'bench_radix_tree.control',
+ 'bench_radix_tree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'bench_radix_tree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'bench_radix_tree',
+ ],
+ },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
subdir('auth_delay')
subdir('auto_explain')
subdir('basic_archive')
+#subdir('bench_radix_tree')
subdir('bloom')
subdir('basebackup_to_shell')
subdir('bool_plperl')
--
2.39.0
v21-0013-Get-rid-of-NODE_IS_EMPTY-macro.patchtext/x-patch; charset=US-ASCII; name=v21-0013-Get-rid-of-NODE_IS_EMPTY-macro.patchDownload
From a3829e483ac68d31efb19cb2eca128e200d92c1f Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sat, 21 Jan 2023 13:40:28 +0700
Subject: [PATCH v21 13/22] Get rid of NODE_IS_EMPTY macro
It's already pretty clear what "count == 0" means, and the
existing comments make it obvious.
---
src/include/lib/radixtree.h | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 4a2dad82bf..567eab4bc8 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -372,7 +372,6 @@ typedef struct RT_NODE
#endif
#define NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
-#define NODE_IS_EMPTY(n) (((RT_PTR_LOCAL) (n))->count == 0)
#define VAR_NODE_HAS_FREE_SLOT(node) \
((node)->base.n.count < (node)->base.n.fanout)
#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
@@ -1701,7 +1700,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
* Return if the leaf node still has keys and we don't need to delete the
* node.
*/
- if (!NODE_IS_EMPTY(node))
+ if (node->count > 0)
return true;
/* Free the empty leaf node */
@@ -1717,7 +1716,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
Assert(deleted);
/* If the node didn't become empty, we stop deleting the key */
- if (!NODE_IS_EMPTY(node))
+ if (node->count > 0)
break;
/* The node became empty */
@@ -2239,7 +2238,6 @@ RT_DUMP(RT_RADIX_TREE *tree)
/* locally declared macros */
#undef NODE_IS_LEAF
-#undef NODE_IS_EMPTY
#undef VAR_NODE_HAS_FREE_SLOT
#undef FIXED_NODE_HAS_FREE_SLOT
#undef RT_NODE_KIND_COUNT
--
2.39.0
v21-0015-Get-rid-of-FIXED_NODE_HAS_FREE_SLOT.patchtext/x-patch; charset=US-ASCII; name=v21-0015-Get-rid-of-FIXED_NODE_HAS_FREE_SLOT.patchDownload
From f1288439306d54677d626cfc0c1cbdc41fcf0ca5 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 22 Jan 2023 11:53:33 +0700
Subject: [PATCH v21 15/22] Get rid of FIXED_NODE_HAS_FREE_SLOT
It's only used in one assert for the node256 kind, whose
fanout is necessarily fixed, and we already have a
convenient macro to compare that with.
---
src/include/lib/radixtree.h | 3 ---
src/include/lib/radixtree_insert_impl.h | 2 +-
2 files changed, 1 insertion(+), 4 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d48c915373..8fbc0b5086 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -374,8 +374,6 @@ typedef struct RT_NODE
#define NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
#define VAR_NODE_HAS_FREE_SLOT(node) \
((node)->base.n.count < (node)->base.n.fanout)
-#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
- ((node)->base.n.count < RT_SIZE_CLASS_INFO[class].fanout)
/* Base type of each node kinds for leaf and inner nodes */
/* The base types must be a be able to accommodate the largest size
@@ -2262,7 +2260,6 @@ RT_DUMP(RT_RADIX_TREE *tree)
/* locally declared macros */
#undef NODE_IS_LEAF
#undef VAR_NODE_HAS_FREE_SLOT
-#undef FIXED_NODE_HAS_FREE_SLOT
#undef RT_NODE_KIND_COUNT
#undef RT_SIZE_CLASS_COUNT
#undef RT_RADIX_TREE_MAGIC
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 8470c8fc70..b484b7a099 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -286,7 +286,7 @@
#else
chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
#endif
- Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+ Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
#ifdef RT_NODE_LEVEL_LEAF
RT_NODE_LEAF_256_SET(n256, chunk, value);
--
2.39.0
v21-0014-Add-some-comments-for-insert-logic.patchtext/x-patch; charset=US-ASCII; name=v21-0014-Add-some-comments-for-insert-logic.patchDownload
From 503ccef8841efcb0809acc73f6f4cc2428342080 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sat, 21 Jan 2023 14:21:55 +0700
Subject: [PATCH v21 14/22] Add some comments for insert logic
---
src/include/lib/radixtree.h | 29 ++++++++++++++++++++++---
src/include/lib/radixtree_insert_impl.h | 5 +++++
2 files changed, 31 insertions(+), 3 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 567eab4bc8..d48c915373 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -731,8 +731,8 @@ RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
}
/*
- * Return index of the first element in 'base' that equals 'key'. Return -1
- * if there is no such element.
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
*/
static inline int
RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
@@ -762,14 +762,22 @@ RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
#endif
#ifndef USE_NO_SIMD
+ /* replicate the search key */
spread_chunk = vector8_broadcast(chunk);
+
+ /* compare to the 32 keys stored in the node */
vector8_load(&haystack1, &node->chunks[0]);
vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
cmp1 = vector8_eq(spread_chunk, haystack1);
cmp2 = vector8_eq(spread_chunk, haystack2);
+
+ /* convert comparison to a bitfield */
bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+ /* mask off invalid entries */
bitfield &= ((UINT64CONST(1) << count) - 1);
+ /* convert bitfield to index by counting trailing zeros */
if (bitfield)
index_simd = pg_rightmost_one_pos32(bitfield);
@@ -781,7 +789,8 @@ RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
}
/*
- * Return index of the chunk to insert into chunks in the given node.
+ * Return index of the node's chunk array to insert into,
+ * such that the chunk array remains ordered.
*/
static inline int
RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
@@ -804,12 +813,26 @@ RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
for (index = 0; index < count; index++)
{
+ /*
+ * This is coded with '>=' to match what we can do with SIMD,
+ * with an assert to keep us honest.
+ */
if (node->chunks[index] >= chunk)
+ {
+ Assert(node->chunks[index] != chunk);
break;
+ }
}
#endif
#ifndef USE_NO_SIMD
+ /*
+ * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+ * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+ * we need to play some trickery using vector8_min() to effectively get
+ * <=. There'll never be any equal elements in the current uses, but that's
+ * what we get here...
+ */
spread_chunk = vector8_broadcast(chunk);
vector8_load(&haystack1, &node->chunks[0]);
vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 16461bdb03..8470c8fc70 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -162,6 +162,11 @@
#endif
}
+ /*
+ * Since we just copied a dense array, we can set the bits
+ * using a single store, provided the length of that array
+ * is at most the number of bits in a bitmapword.
+ */
Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
--
2.39.0
v21-0011-Expand-commentary-for-kinds-vs.-size-classes.patchtext/x-patch; charset=US-ASCII; name=v21-0011-Expand-commentary-for-kinds-vs.-size-classes.patchDownload
From 8711b9afb019ba45a5c4c3e2ec41f72130208a68 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sat, 21 Jan 2023 12:52:53 +0700
Subject: [PATCH v21 11/22] Expand commentary for kinds vs. size classes
Also move class enum closer to array and add #undef's
---
src/include/lib/radixtree.h | 76 ++++++++++++++++++++++++++-----------
1 file changed, 53 insertions(+), 23 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 6cc8442c89..4a2dad82bf 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -288,22 +288,26 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
#define BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
/*
- * Supported radix tree node kinds and size classes.
+ * Node kinds
*
- * There are 4 node kinds and each node kind have one or two size classes,
- * partial and full. The size classes in the same node kind have the same
- * node structure but have the different number of fanout that is stored
- * in 'fanout' of RT_NODE. For example in size class 15, when a 16th element
- * is to be inserted, we allocate a larger area and memcpy the entire old
- * node to it.
+ * The different node kinds are what make the tree "adaptive".
*
- * This technique allows us to limit the node kinds to 4, which limits the
- * number of cases in switch statements. It also allows a possible future
- * optimization to encode the node kind in a pointer tag.
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
*
- * These size classes have been chose carefully so that it minimizes the
- * allocator padding in both the inner and leaf nodes on DSA.
- * node
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ * statments.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ * in the future to tag the node pointer with the kind, even on
+ * platforms with 32-bit pointers. This might speed up node traversal
+ * in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
*/
#define RT_NODE_KIND_3 0x00
#define RT_NODE_KIND_32 0x01
@@ -320,16 +324,6 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
#endif /* RT_COMMON */
-
-typedef enum RT_SIZE_CLASS
-{
- RT_CLASS_3_FULL = 0,
- RT_CLASS_32_PARTIAL,
- RT_CLASS_32_FULL,
- RT_CLASS_125_FULL,
- RT_CLASS_256
-} RT_SIZE_CLASS;
-
/* Common type for all nodes types */
typedef struct RT_NODE
{
@@ -508,6 +502,37 @@ typedef struct RT_NODE_LEAF_256
RT_VALUE_TYPE values[RT_NODE_MAX_SLOTS];
} RT_NODE_LEAF_256;
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+ RT_CLASS_3_FULL = 0,
+ RT_CLASS_32_PARTIAL,
+ RT_CLASS_32_FULL,
+ RT_CLASS_125_FULL,
+ RT_CLASS_256
+} RT_SIZE_CLASS;
+
/* Information for each size class */
typedef struct RT_SIZE_CLASS_ELEM
{
@@ -2217,6 +2242,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef NODE_IS_EMPTY
#undef VAR_NODE_HAS_FREE_SLOT
#undef FIXED_NODE_HAS_FREE_SLOT
+#undef RT_NODE_KIND_COUNT
#undef RT_SIZE_CLASS_COUNT
#undef RT_RADIX_TREE_MAGIC
@@ -2229,6 +2255,10 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_ITER
#undef RT_NODE
#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
#undef RT_NODE_BASE_3
#undef RT_NODE_BASE_32
#undef RT_NODE_BASE_125
--
2.39.0
v21-0019-Standardize-on-testing-for-is-leaf.patchtext/x-patch; charset=US-ASCII; name=v21-0019-Standardize-on-testing-for-is-leaf.patchDownload
From 42bdeca3facdcaf43284ba0a6c85b6db0ac63ead Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 22 Jan 2023 15:10:10 +0700
Subject: [PATCH v21 19/22] Standardize on testing for "is leaf"
Some recent code decided to test for "is inner", so make
everything consistent.
---
src/include/lib/radixtree.h | 38 ++++++++++++-------------
src/include/lib/radixtree_insert_impl.h | 18 ++++++------
2 files changed, 28 insertions(+), 28 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 95124696ef..5927437034 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1019,24 +1019,24 @@ RT_SHIFT_GET_MAX_VAL(int shift)
* Allocate a new node with the given node kind.
*/
static RT_PTR_ALLOC
-RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
{
RT_PTR_ALLOC allocnode;
size_t allocsize;
- if (inner)
- allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
- else
+ if (is_leaf)
allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+ else
+ allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
#ifdef RT_SHMEM
allocnode = dsa_allocate(tree->dsa, allocsize);
#else
- if (inner)
- allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+ if (is_leaf)
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
allocsize);
else
- allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
allocsize);
#endif
@@ -1050,12 +1050,12 @@ RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
/* Initialize the node contents */
static inline void
-RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner)
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
{
- if (inner)
- MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
- else
+ if (is_leaf)
MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+ else
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
node->kind = kind;
@@ -1082,13 +1082,13 @@ static void
RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
{
int shift = RT_KEY_GET_SHIFT(key);
- bool inner = shift > 0;
+ bool is_leaf = shift == 0;
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL newnode;
- allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, inner);
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
newnode = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, inner);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
newnode->shift = shift;
tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
tree->ctl->root = allocnode;
@@ -1107,10 +1107,10 @@ RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
*/
static inline RT_PTR_LOCAL
RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
- uint8 new_kind, uint8 new_class, bool inner)
+ uint8 new_kind, uint8 new_class, bool is_leaf)
{
RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(newnode, new_kind, new_class, inner);
+ RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
RT_COPY_NODE(newnode, node);
return newnode;
@@ -1247,11 +1247,11 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value, RT_PTR_LOCAL
RT_PTR_ALLOC allocchild;
RT_PTR_LOCAL newchild;
int newshift = shift - RT_NODE_SPAN;
- bool inner = newshift > 0;
+ bool is_leaf = newshift == 0;
- allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, inner);
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
newchild = RT_PTR_GET_LOCAL(tree, allocchild);
- RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, inner);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
newchild->shift = newshift;
RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 0fcebf1c6b..22aca0e6cc 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -16,10 +16,10 @@
bool chunk_exists = false;
#ifdef RT_NODE_LEVEL_LEAF
- const bool inner = false;
+ const bool is_leaf = true;
Assert(RT_NODE_IS_LEAF(node));
#else
- const bool inner = true;
+ const bool is_leaf = false;
Assert(!RT_NODE_IS_LEAF(node));
#endif
@@ -52,8 +52,8 @@
const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
/* grow node from 3 to 32 */
- allocnode = RT_ALLOC_NODE(tree, new_class, inner);
- newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
new32 = (RT_NODE32_TYPE *) newnode;
#ifdef RT_NODE_LEVEL_LEAF
@@ -124,7 +124,7 @@
Assert(n32->base.n.fanout == class32_min.fanout);
/* grow to the next size class of this kind */
- allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
newnode = RT_PTR_GET_LOCAL(tree, allocnode);
n32 = (RT_NODE32_TYPE *) newnode;
@@ -150,8 +150,8 @@
Assert(n32->base.n.fanout == class32_max.fanout);
/* grow node from 32 to 125 */
- allocnode = RT_ALLOC_NODE(tree, new_class, inner);
- newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
new125 = (RT_NODE125_TYPE *) newnode;
for (int i = 0; i < class32_max.fanout; i++)
@@ -229,8 +229,8 @@
const RT_SIZE_CLASS new_class = RT_CLASS_256;
/* grow node from 125 to 256 */
- allocnode = RT_ALLOC_NODE(tree, new_class, inner);
- newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
new256 = (RT_NODE256_TYPE *) newnode;
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
--
2.39.0
v21-0016-s-VAR_NODE_HAS_FREE_SLOT-RT_NODE_MUST_GROW.patchtext/x-patch; charset=US-ASCII; name=v21-0016-s-VAR_NODE_HAS_FREE_SLOT-RT_NODE_MUST_GROW.patchDownload
From 55c4517ebaa67e94b2e52e1a1164d44bf09e0bb4 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 22 Jan 2023 12:11:11 +0700
Subject: [PATCH v21 16/22] s/VAR_NODE_HAS_FREE_SLOT/RT_NODE_MUST_GROW/
---
src/include/lib/radixtree.h | 6 +++---
src/include/lib/radixtree_insert_impl.h | 8 ++++----
2 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 8fbc0b5086..cd8b8d1c22 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -372,8 +372,8 @@ typedef struct RT_NODE
#endif
#define NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
-#define VAR_NODE_HAS_FREE_SLOT(node) \
- ((node)->base.n.count < (node)->base.n.fanout)
+#define RT_NODE_MUST_GROW(node) \
+ ((node)->base.n.count == (node)->base.n.fanout)
/* Base type of each node kinds for leaf and inner nodes */
/* The base types must be a be able to accommodate the largest size
@@ -2259,7 +2259,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
/* locally declared macros */
#undef NODE_IS_LEAF
-#undef VAR_NODE_HAS_FREE_SLOT
+#undef RT_NODE_MUST_GROW
#undef RT_NODE_KIND_COUNT
#undef RT_SIZE_CLASS_COUNT
#undef RT_RADIX_TREE_MAGIC
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index b484b7a099..a0f46b37d3 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -43,7 +43,7 @@
break;
}
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n3)))
+ if (unlikely(RT_NODE_MUST_GROW(n3)))
{
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL newnode;
@@ -114,7 +114,7 @@
break;
}
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+ if (unlikely(RT_NODE_MUST_GROW(n32)) &&
n32->base.n.fanout == class32_min.fanout)
{
RT_PTR_ALLOC allocnode;
@@ -137,7 +137,7 @@
node = newnode;
}
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ if (unlikely(RT_NODE_MUST_GROW(n32)))
{
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL newnode;
@@ -218,7 +218,7 @@
break;
}
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ if (unlikely(RT_NODE_MUST_GROW(n125)))
{
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL newnode;
--
2.39.0
v21-0017-Remove-some-maintenance-hazards-in-growing-nodes.patchtext/x-patch; charset=US-ASCII; name=v21-0017-Remove-some-maintenance-hazards-in-growing-nodes.patchDownload
From 5c577598bc1f667333c97dca70155df6d296c251 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 22 Jan 2023 13:29:18 +0700
Subject: [PATCH v21 17/22] Remove some maintenance hazards in growing nodes
Arrange so that kinds with only one size class have no
"full" suffix. This ensures that splitting such a class
into multiple classes will force compilation errors if
the dev has not thought through which new class should
apply in each case.
For node32, make growing into a new size class a bit
more general. It's not clear we would ever need more
than 2 classes, but let's not put up additional road
blocks. Change partial/full to min/max. It's a bit
shorter this way, matches some newer coding, and allows
for the possibility of a "mid" class.
Also remove RT_KIND_MIN_SIZE_CLASS, since it doesn't
reduce the need for future changes, only makes such
a change further away from the effect.
In passing, move a declaration the block where it's used.
---
src/include/lib/radixtree.h | 66 +++++++++++--------------
src/include/lib/radixtree_insert_impl.h | 16 +++---
2 files changed, 37 insertions(+), 45 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index cd8b8d1c22..7c3f3dcf4f 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -196,12 +196,11 @@
#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
-#define RT_CLASS_3_FULL RT_MAKE_NAME(class_3_full)
-#define RT_CLASS_32_PARTIAL RT_MAKE_NAME(class_32_partial)
-#define RT_CLASS_32_FULL RT_MAKE_NAME(class_32_full)
-#define RT_CLASS_125_FULL RT_MAKE_NAME(class_125_full)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
#define RT_CLASS_256 RT_MAKE_NAME(class_256)
-#define RT_KIND_MIN_SIZE_CLASS RT_MAKE_NAME(kind_min_size_class)
/* generate forward declarations necessary to use the radix tree */
#ifdef RT_DECLARE
@@ -523,10 +522,10 @@ typedef struct RT_NODE_LEAF_256
*/
typedef enum RT_SIZE_CLASS
{
- RT_CLASS_3_FULL = 0,
- RT_CLASS_32_PARTIAL,
- RT_CLASS_32_FULL,
- RT_CLASS_125_FULL,
+ RT_CLASS_3 = 0,
+ RT_CLASS_32_MIN,
+ RT_CLASS_32_MAX,
+ RT_CLASS_125,
RT_CLASS_256
} RT_SIZE_CLASS;
@@ -542,25 +541,25 @@ typedef struct RT_SIZE_CLASS_ELEM
} RT_SIZE_CLASS_ELEM;
static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
- [RT_CLASS_3_FULL] = {
+ [RT_CLASS_3] = {
.name = "radix tree node 3",
.fanout = 3,
.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
.leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
},
- [RT_CLASS_32_PARTIAL] = {
+ [RT_CLASS_32_MIN] = {
.name = "radix tree node 15",
.fanout = 15,
.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
},
- [RT_CLASS_32_FULL] = {
+ [RT_CLASS_32_MAX] = {
.name = "radix tree node 32",
.fanout = 32,
.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
},
- [RT_CLASS_125_FULL] = {
+ [RT_CLASS_125] = {
.name = "radix tree node 125",
.fanout = 125,
.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
@@ -576,14 +575,6 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
-/* Map from the node kind to its minimum size class */
-static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
- [RT_NODE_KIND_3] = RT_CLASS_3_FULL,
- [RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
- [RT_NODE_KIND_125] = RT_CLASS_125_FULL,
- [RT_NODE_KIND_256] = RT_CLASS_256,
-};
-
#ifdef RT_SHMEM
/* A magic value used to identify our radix tree */
#define RT_RADIX_TREE_MAGIC 0x54A48167
@@ -893,7 +884,7 @@ static inline void
RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
{
- const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_FULL].fanout;
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
@@ -905,7 +896,7 @@ static inline void
RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
{
- const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_FULL].fanout;
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
@@ -1105,9 +1096,9 @@ RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL newnode;
- allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, inner);
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, inner);
newnode = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3_FULL, inner);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, inner);
newnode->shift = shift;
tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
tree->ctl->root = allocnode;
@@ -1230,9 +1221,9 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
RT_PTR_LOCAL node;
RT_NODE_INNER_3 *n3;
- allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, true);
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
node = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3_FULL, true);
+ RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
node->shift = shift;
node->count = 1;
@@ -1268,9 +1259,9 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value, RT_PTR_LOCAL
int newshift = shift - RT_NODE_SPAN;
bool inner = newshift > 0;
- allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, inner);
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, inner);
newchild = RT_PTR_GET_LOCAL(tree, allocchild);
- RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3_FULL, inner);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, inner);
newchild->shift = newshift;
RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
@@ -2007,10 +1998,10 @@ RT_STATS(RT_RADIX_TREE *tree)
ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
tree->ctl->num_keys,
tree->ctl->root->shift / RT_NODE_SPAN,
- tree->ctl->cnt[RT_CLASS_3_FULL],
- tree->ctl->cnt[RT_CLASS_32_PARTIAL],
- tree->ctl->cnt[RT_CLASS_32_FULL],
- tree->ctl->cnt[RT_CLASS_125_FULL],
+ tree->ctl->cnt[RT_CLASS_3],
+ tree->ctl->cnt[RT_CLASS_32_MIN],
+ tree->ctl->cnt[RT_CLASS_32_MAX],
+ tree->ctl->cnt[RT_CLASS_125],
tree->ctl->cnt[RT_CLASS_256])));
}
@@ -2292,12 +2283,11 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_SIZE_CLASS
#undef RT_SIZE_CLASS_ELEM
#undef RT_SIZE_CLASS_INFO
-#undef RT_CLASS_3_FULL
-#undef RT_CLASS_32_PARTIAL
-#undef RT_CLASS_32_FULL
-#undef RT_CLASS_125_FULL
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
#undef RT_CLASS_256
-#undef RT_KIND_MIN_SIZE_CLASS
/* function declarations */
#undef RT_CREATE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index a0f46b37d3..e3c3f7a69d 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -49,7 +49,7 @@
RT_PTR_LOCAL newnode;
RT_NODE32_TYPE *new32;
const uint8 new_kind = RT_NODE_KIND_32;
- const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
/* grow node from 3 to 32 */
allocnode = RT_ALLOC_NODE(tree, new_class, inner);
@@ -96,8 +96,7 @@
/* FALLTHROUGH */
case RT_NODE_KIND_32:
{
- const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_PARTIAL];
- const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_FULL];
+ const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
int idx;
@@ -115,11 +114,14 @@
}
if (unlikely(RT_NODE_MUST_GROW(n32)) &&
- n32->base.n.fanout == class32_min.fanout)
+ n32->base.n.fanout < class32_max.fanout)
{
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL newnode;
- const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
+ const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+ Assert(n32->base.n.fanout == class32_min.fanout);
/* grow to the next size class of this kind */
allocnode = RT_ALLOC_NODE(tree, new_class, inner);
@@ -143,7 +145,7 @@
RT_PTR_LOCAL newnode;
RT_NODE125_TYPE *new125;
const uint8 new_kind = RT_NODE_KIND_125;
- const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+ const RT_SIZE_CLASS new_class = RT_CLASS_125;
Assert(n32->base.n.fanout == class32_max.fanout);
@@ -224,7 +226,7 @@
RT_PTR_LOCAL newnode;
RT_NODE256_TYPE *new256;
const uint8 new_kind = RT_NODE_KIND_256;
- const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+ const RT_SIZE_CLASS new_class = RT_CLASS_256;
/* grow node from 125 to 256 */
allocnode = RT_ALLOC_NODE(tree, new_class, inner);
--
2.39.0
v21-0018-Clean-up-symbols.patchtext/x-patch; charset=US-ASCII; name=v21-0018-Clean-up-symbols.patchDownload
From 96e730fd7056ca0e13b36489ce9e6717fef37318 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 22 Jan 2023 14:37:53 +0700
Subject: [PATCH v21 18/22] Clean up symbols
Remove remaming stragglers who weren't named "RT_*"
and get rid of the temporary expedient RT_COMMON
block in favor of explicit #undefs everywhere.
---
src/include/lib/radixtree.h | 91 ++++++++++++++-----------
src/include/lib/radixtree_delete_impl.h | 4 +-
src/include/lib/radixtree_insert_impl.h | 4 +-
src/include/lib/radixtree_iter_impl.h | 4 +-
src/include/lib/radixtree_search_impl.h | 4 +-
5 files changed, 58 insertions(+), 49 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 7c3f3dcf4f..95124696ef 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -246,14 +246,6 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
/* generate implementation of the radix tree */
#ifdef RT_DEFINE
-/* macros and types common to all implementations */
-#ifndef RT_COMMON
-#define RT_COMMON
-
-#ifdef RT_DEBUG
-#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
-#endif
-
/* The number of bits encoded in one tree level */
#define RT_NODE_SPAN BITS_PER_BYTE
@@ -321,8 +313,6 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
#define RT_SLAB_BLOCK_SIZE(size) \
Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
-#endif /* RT_COMMON */
-
/* Common type for all nodes types */
typedef struct RT_NODE
{
@@ -370,7 +360,7 @@ typedef struct RT_NODE
#define RT_INVALID_PTR_ALLOC NULL
#endif
-#define NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
+#define RT_NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
#define RT_NODE_MUST_GROW(node) \
((node)->base.n.count == (node)->base.n.fanout)
@@ -916,14 +906,14 @@ RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
static inline RT_PTR_ALLOC
RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
return node->children[node->base.slot_idxs[chunk]];
}
static inline RT_VALUE_TYPE
RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
return node->values[node->base.slot_idxs[chunk]];
}
@@ -934,7 +924,7 @@ RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
static inline bool
RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
return node->children[chunk] != RT_INVALID_PTR_ALLOC;
}
@@ -944,14 +934,14 @@ RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
int idx = BM_IDX(chunk);
int bitnum = BM_BIT(chunk);
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
}
static inline RT_PTR_ALLOC
RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
return node->children[chunk];
}
@@ -959,7 +949,7 @@ RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
static inline RT_VALUE_TYPE
RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
return node->values[chunk];
}
@@ -968,7 +958,7 @@ RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
static inline void
RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->children[chunk] = child;
}
@@ -979,7 +969,7 @@ RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
int idx = BM_IDX(chunk);
int bitnum = BM_BIT(chunk);
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->isset[idx] |= ((bitmapword) 1 << bitnum);
node->values[chunk] = value;
}
@@ -988,7 +978,7 @@ RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
static inline void
RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->children[chunk] = RT_INVALID_PTR_ALLOC;
}
@@ -998,7 +988,7 @@ RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
int idx = BM_IDX(chunk);
int bitnum = BM_BIT(chunk);
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->isset[idx] &= ~((bitmapword) 1 << bitnum);
}
@@ -1458,7 +1448,7 @@ RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
CHECK_FOR_INTERRUPTS();
/* The leaf node doesn't have child pointers */
- if (NODE_IS_LEAF(node))
+ if (RT_NODE_IS_LEAF(node))
{
dsa_free(tree->dsa, ptr);
return;
@@ -1587,7 +1577,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
child = RT_PTR_GET_LOCAL(tree, stored_child);
- if (NODE_IS_LEAF(child))
+ if (RT_NODE_IS_LEAF(child))
break;
if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
@@ -1637,7 +1627,7 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
{
RT_PTR_ALLOC child;
- if (NODE_IS_LEAF(node))
+ if (RT_NODE_IS_LEAF(node))
break;
if (!RT_NODE_SEARCH_INNER(node, key, &child))
@@ -1788,7 +1778,7 @@ RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
node_iter->current_idx = -1;
/* We don't advance the leaf node iterator here */
- if (NODE_IS_LEAF(node))
+ if (RT_NODE_IS_LEAF(node))
return;
/* Advance to the next slot in the inner node */
@@ -1972,7 +1962,7 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
}
case RT_NODE_KIND_256:
{
- if (NODE_IS_LEAF(node))
+ if (RT_NODE_IS_LEAF(node))
{
RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
int cnt = 0;
@@ -1992,6 +1982,9 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
/***************** DEBUG FUNCTIONS *****************/
#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
RT_SCOPE void
RT_STATS(RT_RADIX_TREE *tree)
{
@@ -2012,7 +2005,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
char space[125] = {0};
fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
- NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
(node->kind == RT_NODE_KIND_3) ? 3 :
(node->kind == RT_NODE_KIND_32) ? 32 :
(node->kind == RT_NODE_KIND_125) ? 125 : 256,
@@ -2028,11 +2021,11 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
{
for (int i = 0; i < node->count; i++)
{
- if (NODE_IS_LEAF(node))
+ if (RT_NODE_IS_LEAF(node))
{
RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
- fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
space, n3->base.chunks[i], (uint64) n3->values[i]);
}
else
@@ -2054,11 +2047,11 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
{
for (int i = 0; i < node->count; i++)
{
- if (NODE_IS_LEAF(node))
+ if (RT_NODE_IS_LEAF(node))
{
RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
- fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
space, n32->base.chunks[i], (uint64) n32->values[i]);
}
else
@@ -2090,14 +2083,14 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
}
- if (NODE_IS_LEAF(node))
+ if (RT_NODE_IS_LEAF(node))
{
RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
fprintf(stderr, ", isset-bitmap:");
for (int i = 0; i < BM_IDX(RT_SLOT_IDX_LIMIT); i++)
{
- fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
+ fprintf(stderr, RT_UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
}
fprintf(stderr, "\n");
}
@@ -2107,11 +2100,11 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
continue;
- if (NODE_IS_LEAF(node))
+ if (RT_NODE_IS_LEAF(node))
{
RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
- fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
space, i, (uint64) RT_NODE_LEAF_125_GET_VALUE(n125, i));
}
else
@@ -2134,14 +2127,14 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
{
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
{
- if (NODE_IS_LEAF(node))
+ if (RT_NODE_IS_LEAF(node))
{
RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
continue;
- fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
space, i, (uint64) RT_NODE_LEAF_256_GET_VALUE(n256, i));
}
else
@@ -2174,7 +2167,7 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
int level = 0;
elog(NOTICE, "-----------------------------------------------------------");
- elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+ elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ")",
tree->ctl->max_val, tree->ctl->max_val);
if (!tree->ctl->root)
@@ -2185,7 +2178,7 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
if (key > tree->ctl->max_val)
{
- elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+ elog(NOTICE, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val",
key, key);
return;
}
@@ -2198,7 +2191,7 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
RT_DUMP_NODE(node, level, false);
- if (NODE_IS_LEAF(node))
+ if (RT_NODE_IS_LEAF(node))
{
uint64 dummy;
@@ -2249,15 +2242,30 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_VALUE_TYPE
/* locally declared macros */
-#undef NODE_IS_LEAF
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef BM_IDX
+#undef BM_BIT
+#undef RT_NODE_IS_LEAF
#undef RT_NODE_MUST_GROW
#undef RT_NODE_KIND_COUNT
#undef RT_SIZE_CLASS_COUNT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
/* type declarations */
#undef RT_RADIX_TREE
#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
#undef RT_PTR_ALLOC
#undef RT_INVALID_PTR_ALLOC
#undef RT_HANDLE
@@ -2295,6 +2303,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_ATTACH
#undef RT_DETACH
#undef RT_GET_HANDLE
+#undef RT_SEARCH
#undef RT_SET
#undef RT_BEGIN_ITERATE
#undef RT_ITERATE_NEXT
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index b9f07f4eb5..99c90771b9 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -17,9 +17,9 @@
uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
#ifdef RT_NODE_LEVEL_LEAF
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
#else
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
#endif
switch (node->kind)
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index e3c3f7a69d..0fcebf1c6b 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -17,10 +17,10 @@
#ifdef RT_NODE_LEVEL_LEAF
const bool inner = false;
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
#else
const bool inner = true;
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
#endif
switch (node->kind)
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index c428531438..823d7107c4 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -18,11 +18,11 @@
#ifdef RT_NODE_LEVEL_LEAF
RT_VALUE_TYPE value;
- Assert(NODE_IS_LEAF(node_iter->node));
+ Assert(RT_NODE_IS_LEAF(node_iter->node));
#else
RT_PTR_LOCAL child = NULL;
- Assert(!NODE_IS_LEAF(node_iter->node));
+ Assert(!RT_NODE_IS_LEAF(node_iter->node));
#endif
#ifdef RT_SHMEM
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index 31138b6a72..c4352045c8 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -17,12 +17,12 @@
#ifdef RT_NODE_LEVEL_LEAF
RT_VALUE_TYPE value = 0;
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
#else
#ifndef RT_ACTION_UPDATE
RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
#endif
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
#endif
switch (node->kind)
--
2.39.0
v21-0020-Do-some-rewriting-and-proofreading-of-comments.patchtext/x-patch; charset=US-ASCII; name=v21-0020-Do-some-rewriting-and-proofreading-of-comments.patchDownload
From bf3219324a0b336166390dacfe2ab91ba96d6417 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 23 Jan 2023 18:00:20 +0700
Subject: [PATCH v21 20/22] Do some rewriting and proofreading of comments
In passing, change one ternary operator to if/else.
---
src/include/lib/radixtree.h | 160 +++++++++++++++++++++---------------
1 file changed, 92 insertions(+), 68 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 5927437034..7fcd212ea4 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -9,25 +9,38 @@
* types, each with a different numbers of elements. Depending on the number of
* children, the appropriate node type is used.
*
- * There are some differences from the proposed implementation. For instance,
- * there is not support for path compression and lazy path expansion. The radix
- * tree supports fixed length of the key so we don't expect the tree level
- * wouldn't be high.
+ * WIP: notes about traditional radix tree trading off span vs height...
*
- * Both the key and the value are 64-bit unsigned integer. The inner nodes and
- * the leaf nodes have slightly different structure: for inner tree nodes,
- * shift > 0, store the pointer to its child node as the value. The leaf nodes,
- * shift == 0, have the 64-bit unsigned integer that is specified by the user as
- * the value. The paper refers to this technique as "Multi-value leaves". We
- * choose it to avoid an additional pointer traversal. It is the reason this code
- * currently does not support variable-length keys.
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
*
- * XXX: Most functions in this file have two variants for inner nodes and leaf
- * nodes, therefore there are duplication codes. While this sometimes makes the
- * code maintenance tricky, this reduces branch prediction misses when judging
- * whether the node is a inner node of a leaf node.
+ * The ART paper mentions three ways to implement leaves:
*
- * XXX: the radix tree node never be shrunk.
+ * "- Single-value leaves: The values are stored using an addi-
+ * tional leaf node type which stores one value.
+ * - Multi-value leaves: The values are stored in one of four
+ * different leaf node types, which mirror the structure of
+ * inner nodes, but contain values instead of pointers.
+ * - Combined pointer/value slots: If values fit into point-
+ * ers, no separate node types are necessary. Instead, each
+ * pointer storage location in an inner node can either
+ * store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * WIP: the radix tree nodes don't shrink.
*
* To generate a radix tree and associated functions for a use case several
* macros have to be #define'ed before this file is included. Including
@@ -42,11 +55,11 @@
* - RT_DEFINE - if defined function definitions are generated
* - RT_SCOPE - in which scope (e.g. extern, static inline) do function
* declarations reside
- * - RT_SHMEM - if defined, the radix tree is created in the DSA area
- * so that multiple processes can access it simultaneously.
* - RT_VALUE_TYPE - the type of the value.
*
* Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ * so that multiple processes can access it simultaneously.
* - RT_DEBUG - if defined add stats tracking and debugging functions
*
* Interface
@@ -54,9 +67,6 @@
*
* RT_CREATE - Create a new, empty radix tree
* RT_FREE - Free the radix tree
- * RT_ATTACH - Attach to the radix tree
- * RT_DETACH - Detach from the radix tree
- * RT_GET_HANDLE - Return the handle of the radix tree
* RT_SEARCH - Search a key-value pair
* RT_SET - Set a key-value pair
* RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
@@ -64,11 +74,12 @@
* RT_END_ITER - End iteration
* RT_MEMORY_USAGE - Get the memory usage
*
- * RT_CREATE() creates an empty radix tree in the given memory context
- * and memory contexts for all kinds of radix tree node under the memory context.
+ * Interface for Shared Memory
+ * ---------
*
- * RT_ITERATE_NEXT() ensures returning key-value pairs in the ascending
- * order of the key.
+ * RT_ATTACH - Attach to the radix tree
+ * RT_DETACH - Detach from the radix tree
+ * RT_GET_HANDLE - Return the handle of the radix tree
*
* Optional Interface
* ---------
@@ -360,13 +371,23 @@ typedef struct RT_NODE
#define RT_INVALID_PTR_ALLOC NULL
#endif
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
#define RT_NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
+
#define RT_NODE_MUST_GROW(node) \
((node)->base.n.count == (node)->base.n.fanout)
-/* Base type of each node kinds for leaf and inner nodes */
-/* The base types must be a be able to accommodate the largest size
-class for variable-sized node kinds*/
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
typedef struct RT_NODE_BASE_3
{
RT_NODE n;
@@ -384,9 +405,9 @@ typedef struct RT_NODE_BASE_32
} RT_NODE_BASE_32;
/*
- * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
- * 256, to store indexes into a second array that contains up to 125 values (or
- * child pointers in inner nodes).
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
*/
typedef struct RT_NODE_BASE_125
{
@@ -407,15 +428,8 @@ typedef struct RT_NODE_BASE_256
/*
* Inner and leaf nodes.
*
- * Theres are separate for two main reasons:
- *
- * 1) the value type might be different than something fitting into a pointer
- * width type
- * 2) Need to represent non-existing values in a key-type independent way.
- *
- * 1) is clearly worth being concerned about, but it's not clear 2) is as
- * good. It might be better to just indicate non-existing entries the same way
- * in inner nodes.
+ * Theres are separate because the value type might be different than
+ * something fitting into a pointer-width type.
*/
typedef struct RT_NODE_INNER_3
{
@@ -466,8 +480,10 @@ typedef struct RT_NODE_LEAF_125
} RT_NODE_LEAF_125;
/*
- * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * node-256 is the largest node type. This node has an array
* for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
*/
typedef struct RT_NODE_INNER_256
{
@@ -481,7 +497,10 @@ typedef struct RT_NODE_LEAF_256
{
RT_NODE_BASE_256 base;
- /* isset is a bitmap to track which slot is in use */
+ /*
+ * Unlike with inner256, zero is a valid value here, so we use a
+ * bitmap to track which slot is in use.
+ */
bitmapword isset[BM_IDX(RT_NODE_MAX_SLOTS)];
/* Slots for 256 values */
@@ -570,7 +589,8 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
#define RT_RADIX_TREE_MAGIC 0x54A48167
#endif
-/* A radix tree with nodes */
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
typedef struct RT_RADIX_TREE_CONTROL
{
#ifdef RT_SHMEM
@@ -588,7 +608,7 @@ typedef struct RT_RADIX_TREE_CONTROL
#endif
} RT_RADIX_TREE_CONTROL;
-/* A radix tree with nodes */
+/* Entry point for allocating and accessing the tree */
typedef struct RT_RADIX_TREE
{
MemoryContext context;
@@ -613,15 +633,15 @@ typedef struct RT_RADIX_TREE
* RT_NODE_ITER struct is used to track the iteration within a node.
*
* RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
- * in order to track the iteration of each level. During the iteration, we also
+ * in order to track the iteration of each level. During iteration, we also
* construct the key whenever updating the node iteration information, e.g., when
* advancing the current index within the node or when moving to the next node
* at the same level.
-+ *
-+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
-+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
-+ * We need either a safeguard to disallow other processes to begin the iteration
-+ * while one process is doing or to allow multiple processes to do the iteration.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
*/
typedef struct RT_NODE_ITER
{
@@ -637,7 +657,7 @@ typedef struct RT_ITER
RT_NODE_ITER stack[RT_MAX_LEVEL];
int stack_len;
- /* The key is being constructed during the iteration */
+ /* The key is constructed during iteration */
uint64 key;
} RT_ITER;
@@ -672,8 +692,8 @@ RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
}
/*
- * Return index of the first element in 'base' that equals 'key'. Return -1
- * if there is no such element.
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
*/
static inline int
RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
@@ -693,7 +713,8 @@ RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
}
/*
- * Return index of the chunk to insert into chunks in the given node.
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
*/
static inline int
RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
@@ -744,7 +765,7 @@ RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
/* replicate the search key */
spread_chunk = vector8_broadcast(chunk);
- /* compare to the 32 keys stored in the node */
+ /* compare to all 32 keys stored in the node */
vector8_load(&haystack1, &node->chunks[0]);
vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
cmp1 = vector8_eq(spread_chunk, haystack1);
@@ -768,7 +789,7 @@ RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
}
/*
- * Return index of the node's chunk array to insert into,
+ * Return index of the chunk and slot arrays for inserting into the node,
* such that the chunk array remains ordered.
*/
static inline int
@@ -809,7 +830,7 @@ RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
* This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
* no unsigned uint8 comparison instruction exists, at least for SSE2. So
* we need to play some trickery using vector8_min() to effectively get
- * <=. There'll never be any equal elements in the current uses, but that's
+ * <=. There'll never be any equal elements in urrent uses, but that's
* what we get here...
*/
spread_chunk = vector8_broadcast(chunk);
@@ -834,6 +855,7 @@ RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
#endif
}
+
/*
* Functions to manipulate both chunks array and children/values array.
* These are used for node-3 and node-32.
@@ -993,18 +1015,19 @@ RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
}
/*
- * Return the shift that is satisfied to store the given key.
+ * Return the largest shift that will allowing storing the given key.
*/
static inline int
RT_KEY_GET_SHIFT(uint64 key)
{
- return (key == 0)
- ? 0
- : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+ if (key == 0)
+ return 0;
+ else
+ return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
}
/*
- * Return the max value stored in a node with the given shift.
+ * Return the max value that can be stored in the tree with the given shift.
*/
static uint64
RT_SHIFT_GET_MAX_VAL(int shift)
@@ -1155,6 +1178,7 @@ RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
#endif
}
+/* Update the parent's pointer when growing a node */
static inline void
RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
{
@@ -1182,7 +1206,7 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
if (parent == old_child)
{
- /* Replace the root node with the new large node */
+ /* Replace the root node with the new larger node */
tree->ctl->root = new_child;
}
else
@@ -1192,8 +1216,8 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
}
/*
- * The radix tree doesn't sufficient height. Extend the radix tree so it can
- * store the key.
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
*/
static void
RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
@@ -1337,7 +1361,7 @@ RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stor
#undef RT_NODE_LEVEL_INNER
}
-/* Like, RT_NODE_INSERT_INNER, but for leaf nodes */
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
static bool
RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
uint64 key, RT_VALUE_TYPE value)
@@ -1377,7 +1401,7 @@ RT_CREATE(MemoryContext ctx)
#else
tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
- /* Create the slab allocator for each size class */
+ /* Create a slab context for each size class */
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
@@ -1570,7 +1594,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
parent = RT_PTR_GET_LOCAL(tree, stored_child);
shift = parent->shift;
- /* Descend the tree until a leaf node */
+ /* Descend the tree until we reach a leaf node */
while (shift >= 0)
{
RT_PTR_ALLOC new_child;
--
2.39.0
v21-0021-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchtext/x-patch; charset=US-ASCII; name=v21-0021-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload
From f3f586bc84026364d46e7bcf6eddd04a83264de4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v21 21/22] Add TIDStore, to store sets of TIDs
(ItemPointerData) efficiently.
The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.
The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.
This includes a unit test module, in src/test/modules/test_tidstore.
---
doc/src/sgml/monitoring.sgml | 4 +
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 624 ++++++++++++++++++
src/backend/storage/lmgr/lwlock.c | 2 +
src/include/access/tidstore.h | 49 ++
src/include/storage/lwlock.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_tidstore/Makefile | 23 +
.../test_tidstore/expected/test_tidstore.out | 13 +
src/test/modules/test_tidstore/meson.build | 35 +
.../test_tidstore/sql/test_tidstore.sql | 7 +
.../test_tidstore/test_tidstore--1.0.sql | 8 +
.../modules/test_tidstore/test_tidstore.c | 189 ++++++
.../test_tidstore/test_tidstore.control | 4 +
16 files changed, 963 insertions(+)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
create mode 100644 src/test/modules/test_tidstore/Makefile
create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
create mode 100644 src/test/modules/test_tidstore/meson.build
create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
create mode 100644 src/test/modules/test_tidstore/test_tidstore.control
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e3a783abd0..38bc3589ae 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2182,6 +2182,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Waiting to access a shared TID bitmap during a parallel bitmap
index scan.</entry>
</row>
+ <row>
+ <entry><literal>SharedTidStore</literal></entry>
+ <entry>Waiting to access a shared TID store.</entry>
+ </row>
<row>
<entry><literal>SharedTupleStore</literal></entry>
<entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..fa55793227
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,624 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a Tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach(). It can support concurrent updates but only one process
+ * is allowed to iterate over the TidStore at a time.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, item pointers are represented as a pair of 64-bit
+ * key and 64-bit value. First, we construct 64-bit unsigned integer key that
+ * combines the block number and the offset number. The lowest 11 bits represent
+ * the offset number, and the next 32 bits are block number. That is, only 43
+ * bits are used:
+ *
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ *
+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with
+ * the bits, because smaller keys could help keeping the radix tree shallow.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits, and
+ * the rest 37 bits are used as the key:
+ *
+ * value = bitmap representation of XXXXXX
+ * key = XXXXXYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYuu
+ *
+ * The maximum height of the radix tree is 5.
+ *
+ * XXX: if we want to support non-heap table AM that want to use the full
+ * range of possible offset numbers, we'll need to reconsider
+ * TIDSTORE_OFFSET_NBITS value.
+ */
+#define TIDSTORE_OFFSET_NBITS 11
+#define TIDSTORE_VALUE_NBITS 6
+
+/*
+ * Memory consumption depends on the number of Tids stored, but also on the
+ * distribution of them, how the radix tree stores, and the memory management
+ * that backed the radix tree. The maximum bytes that a TidStore can
+ * use is specified by the max_bytes in tidstore_create(). We want the total
+ * amount of memory consumption not to exceed the max_bytes.
+ *
+ * In non-shared cases, the radix tree uses slab allocators for each kind of
+ * node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate the
+ * largest radix tree node in a new slab block, which is approximately 70kB.
+ * Therefore, we deduct 70kB from the maximum bytes.
+ *
+ * In shared cases, DSA allocates the memory segments big enough to follow
+ * a geometric series that approximately doubles the total DSA size (see
+ * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+ * size and the simulation showed, the 75% threshold for the maximum bytes
+ * perfectly works in case where it is a power-of-2, and the 60% threshold
+ * works for other cases.
+ */
+#define TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT (1024L * 70) /* 70kB */
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2 (float) 0.75
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO (float) 0.6
+
+#define KEY_GET_BLKNO(key) \
+ ((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+#define BLKNO_GET_KEY(blkno) \
+ (((uint64) (blkno) << (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+/* The header object for a TidStore */
+typedef struct TidStoreControl
+{
+ /*
+ * 'num_tids' is the number of Tids stored so far. 'max_byte' is the maximum
+ * bytes a TidStore can use. These two fields are commonly used in both
+ * non-shared case and shared case.
+ */
+ uint64 num_tids;
+ uint64 max_bytes;
+
+ /* The below fields are used only in shared case */
+
+ uint32 magic;
+
+ /* protect the shared fields */
+ LWLock lock;
+
+ /* handles for TidStore and radix tree */
+ tidstore_handle handle;
+ shared_rt_handle tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+ /*
+ * Control object. This is allocated in DSA area 'area' in the shared
+ * case, otherwise in backend-local memory.
+ */
+ TidStoreControl *control;
+
+ /* Storage for Tids. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ local_rt_radix_tree *local;
+ shared_rt_radix_tree *shared;
+ } tree;
+
+ /* DSA area for TidStore if used */
+ dsa_area *area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+ TidStore *ts;
+
+ /* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ shared_rt_iter *shared;
+ local_rt_iter *local;
+ } tree_iter;
+
+ /* we returned all tids? */
+ bool finished;
+
+ /* save for the next iteration */
+ uint64 next_key;
+ uint64 next_val;
+
+ /* output for the caller */
+ TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(uint64 max_bytes, dsa_area *area)
+{
+ TidStore *ts;
+
+ ts = palloc0(sizeof(TidStore));
+
+ /*
+ * Create the radix tree for the main storage.
+ */
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+ float ratio = ((max_bytes & (max_bytes - 1)) == 0)
+ ? TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2
+ : TIDSTORE_SHARED_MAX_MEMORY_RATIO;
+
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, area);
+
+ dp = dsa_allocate0(area, sizeof(TidStoreControl));
+ ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+ ts->control->max_bytes =(uint64) (max_bytes * ratio);
+ ts->area = area;
+
+ ts->control->magic = TIDSTORE_MAGIC;
+ LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+ ts->control->handle = dp;
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+ }
+ else
+ {
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+ ts->control->max_bytes = max_bytes - TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT;
+ }
+
+ return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ /* create per-backend state */
+ ts = palloc0(sizeof(TidStore));
+
+ /* Find the control object in shared memory */
+ control = handle;
+
+ /* Set up the TidStore */
+ ts->control = (TidStoreControl *) dsa_get_address(area, control);
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+ ts->area = area;
+
+ return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ shared_rt_detach(ts->tree.shared);
+ pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory. The caller must be certain that
+ * no other backend will attempt to access the TidStore before calling this
+ * function. Other backend must explicitly call tidstore_detach to free up
+ * backend-local memory associated with the TidStore. The backend that calls
+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix
+ * tree.
+ */
+ ts->control->magic = 0;
+ dsa_free(ts->area, ts->control->handle);
+ shared_rt_free(ts->tree.shared);
+ }
+ else
+ {
+ pfree(ts->control);
+ local_rt_free(ts->tree.local);
+ }
+
+ pfree(ts);
+}
+
+/* Forget all collected Tids */
+void
+tidstore_reset(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ if (TidStoreIsShared(ts))
+ {
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /*
+ * Free the radix tree and return allocated DSA segments to
+ * the operating system.
+ */
+ shared_rt_free(ts->tree.shared);
+ dsa_trim(ts->area);
+
+ /* Recreate the radix tree */
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area);
+
+ /* update the radix tree handle as we recreated it */
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+
+ LWLockRelease(&ts->control->lock);
+ }
+ else
+ {
+ local_rt_free(ts->tree.local);
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+ }
+}
+
+static inline void
+tidstore_insert_kv(TidStore *ts, uint64 key, uint64 val)
+{
+ if (TidStoreIsShared(ts))
+ {
+ /*
+ * Since the shared radix tree supports concurrent insert,
+ * we don't need to acquire the lock.
+ */
+ shared_rt_set(ts->tree.shared, key, val);
+ }
+ else
+ local_rt_set(ts->tree.local, key, val);
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+#define NUM_KEYS_PER_BLOCK (1 << (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS))
+ ItemPointerData tid;
+ uint64 key_base;
+ uint64 values[NUM_KEYS_PER_BLOCK] = {0};
+
+ ItemPointerSetBlockNumber(&tid, blkno);
+ key_base = BLKNO_GET_KEY(blkno);
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint64 key;
+ uint32 off;
+ int idx;
+
+ ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+ /* encode the Tid to key and val */
+ key = tid_to_key_off(&tid, &off);
+
+ idx = key - key_base;
+ Assert(idx >= 0 && idx < NUM_KEYS_PER_BLOCK);
+
+ values[idx] |= UINT64CONST(1) << off;
+ }
+
+ /* insert the calculated key-values to the tree */
+ for (int i = 0; i < NUM_KEYS_PER_BLOCK; i++)
+ {
+ if (values[i])
+ {
+ uint64 key = key_base + i;
+
+ tidstore_insert_kv(ts, key, values[i]);
+ }
+ }
+
+ if (TidStoreIsShared(ts))
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /* update statistics */
+ ts->control->num_tids += num_offsets;
+
+ if (TidStoreIsShared(ts))
+ LWLockRelease(&ts->control->lock);
+}
+
+/* Return true if the given Tid is present in TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val;
+ uint32 off;
+ bool found;
+
+ key = tid_to_key_off(tid, &off);
+
+ found = TidStoreIsShared(ts) ?
+ shared_rt_search(ts->tree.shared, key, &val) :
+ local_rt_search(ts->tree.local, key, &val);
+
+ if (!found)
+ return false;
+
+ return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. The caller must be certain that
+ * no other backend will attempt to update the TidStore during the iteration.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+ TidStoreIter *iter;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ iter = palloc0(sizeof(TidStoreIter));
+ iter->ts = ts;
+ iter->result.blkno = InvalidBlockNumber;
+
+ if (TidStoreIsShared(ts))
+ iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+ else
+ iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+ /* If the TidStore is empty, there is no business */
+ if (tidstore_num_tids(ts) == 0)
+ iter->finished = true;
+
+ return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+ else
+ return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a TidStoreIterResult representing Tids
+ * in one page. Offset numbers in the result is sorted.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+ TidStoreIterResult *result = &(iter->result);
+
+ if (iter->finished)
+ return NULL;
+
+ if (BlockNumberIsValid(result->blkno))
+ {
+ result->num_offsets = 0;
+ tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (tidstore_iter_kv(iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = KEY_GET_BLKNO(key);
+
+ if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+ {
+ /*
+ * Remember the key-value pair for the next block for the
+ * next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+ return result;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_extract_tids(iter, key, val);
+ }
+
+ iter->finished = true;
+ return result;
+}
+
+/* Finish an iteration over TidStore */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+ if (TidStoreIsShared(iter->ts))
+ shared_rt_end_iterate(iter->tree_iter.shared);
+ else
+ local_rt_end_iterate(iter->tree_iter.local);
+
+ pfree(iter);
+}
+
+/* Return the number of Tids we collected so far */
+uint64
+tidstore_num_tids(TidStore *ts)
+{
+ uint64 num_tids;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ if (TidStoreIsShared(ts))
+ return ts->control->num_tids;
+
+ LWLockAcquire(&ts->control->lock, LW_SHARED);
+ num_tids = ts->control->num_tids;
+ LWLockRelease(&ts->control->lock);
+
+ return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+uint64
+tidstore_max_memory(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+uint64
+tidstore_memory_usage(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return (uint64) sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+ else
+ return (uint64) sizeof(TidStore) + sizeof(TidStore) +
+ local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->handle;
+}
+
+/* Extract Tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+ TidStoreIterResult *result = (&iter->result);
+
+ for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ if ((val & (UINT64CONST(1) << i)) == 0)
+ continue;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= i;
+
+ off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+ result->offsets[result->num_offsets++] = off;
+ }
+
+ result->blkno = KEY_GET_BLKNO(key);
+}
+
+/*
+ * Encode a Tid to key and val.
+ */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint64 tid_i;
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+ *off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+ upper = tid_i >> TIDSTORE_VALUE_NBITS;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ return upper;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 196bece0a3..cbfe329591 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
"SharedTupleStore",
/* LWTRANCHE_SHARED_TIDBITMAP: */
"SharedTidBitmap",
+ /* LWTRANCHE_SHARED_TIDSTORE: */
+ "SharedTidStore",
/* LWTRANCHE_PARALLEL_APPEND: */
"ParallelAppend",
/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..ec3d9f87f5
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+ BlockNumber blkno;
+ OffsetNumber offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+ int num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(uint64 max_bytes, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern uint64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern uint64 tidstore_max_memory(TidStore *ts);
+extern uint64 tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif /* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index e4162db613..7b7663e2e1 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
LWTRANCHE_SHARED_TUPLESTORE,
LWTRANCHE_SHARED_TIDBITMAP,
+ LWTRANCHE_SHARED_TIDSTORE,
LWTRANCHE_PARALLEL_APPEND,
LWTRANCHE_PER_XACT_PREDICATE_LIST,
LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
test_rls_hooks \
test_shm_mq \
test_slru \
+ test_tidstore \
unsafe_tests \
worker_spi
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
subdir('test_rls_hooks')
subdir('test_shm_mq')
subdir('test_slru')
+subdir('test_tidstore')
subdir('unsafe_tests')
subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+ $(WIN32RES) \
+ test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE: testing empty tidstore
+NOTICE: testing basic operations
+ test_tidstore
+---------------
+
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+ 'test_tidstore.c',
+)
+
+if host_system == 'windows'
+ test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_tidstore',
+ '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+ test_tidstore_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+ 'test_tidstore.control',
+ 'test_tidstore--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_tidstore',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_tidstore',
+ ],
+ },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..5d38387450
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,189 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ * Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+ ItemPointerData tid;
+ bool found;
+
+ ItemPointerSet(&tid, blkno, off);
+
+ found = tidstore_lookup_tid(ts, &tid);
+
+ if (found != expect)
+ elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+ blkno, off, found, expect);
+}
+
+static void
+test_basic(void)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS 5
+#define TEST_TIDSTORE_NUM_OFFSETS 11
+#define IS_POWER_OF_TWO(x) (((x) & (x - 1)) == 0)
+
+ TidStore *ts;
+ TidStoreIter *iter;
+ TidStoreIterResult *iter_result;
+ BlockNumber blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+ };
+ BlockNumber blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+ };
+ OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS] = {
+ 1 << 5, 1 << 6, 1 << 7, 1 << 8, 1 << 9,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3, 1 << 4,
+ 1 << 10
+ };
+ OffsetNumber offs_sorted[TEST_TIDSTORE_NUM_OFFSETS] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3, 1 << 4,
+ 1 << 5, 1 << 6, 1 << 7, 1 << 8, 1 << 9,
+ 1 << 10
+ };
+ int blk_idx;
+
+ elog(NOTICE, "testing basic operations");
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, NULL);
+
+ /* add tids */
+ for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+ tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* lookup test */
+ for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+ off++)
+ {
+ check_tid(ts, 0, off, IS_POWER_OF_TWO(off));
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, IS_POWER_OF_TWO(off));
+ }
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+ elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+ tidstore_num_tids(ts),
+ TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* iteration test */
+ iter = tidstore_begin_iterate(ts);
+ blk_idx = 0;
+ while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+ {
+ /* check the returned block number */
+ if (blks_sorted[blk_idx] != iter_result->blkno)
+ elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+ iter_result->blkno, blks_sorted[blk_idx]);
+
+ /* check the returned offset numbers */
+ if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+ elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+ iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+ for (int i = 0; i < iter_result->num_offsets; i++)
+ {
+ if (offs_sorted[i] != iter_result->offsets[i])
+ elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+ iter_result->offsets[i], iter_result->blkno,
+ offs_sorted[i]);
+ }
+
+ blk_idx++;
+ }
+
+ if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+ elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+ blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+ /* remove all tids */
+ tidstore_reset(ts);
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+ /* lookup test for empty store */
+ for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+ off++)
+ {
+ check_tid(ts, 0, off, false);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, false);
+ }
+
+ tidstore_destroy(ts);
+}
+
+static void
+test_empty(void)
+{
+ TidStore *ts;
+ TidStoreIter *iter;
+ ItemPointerData tid;
+
+ elog(NOTICE, "testing empty tidstore");
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, NULL);
+
+ ItemPointerSet(&tid, 0, FirstOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+ ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+ MaxBlockNumber, MaxOffsetNumber);
+
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+ if (tidstore_is_full(ts))
+ elog(ERROR, "tidstore_is_full on empty store returned true");
+
+ iter = tidstore_begin_iterate(ts);
+
+ if (tidstore_iterate_next(iter) != NULL)
+ elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+ tidstore_end_iterate(iter);
+
+ tidstore_destroy(ts);
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+ test_empty();
+ test_basic();
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
--
2.39.0
v21-0022-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchtext/x-patch; charset=US-ASCII; name=v21-0022-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload
From 2c93e9cdb3b6825df9633bfe9e122b08d936780c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 20 Jan 2023 10:29:31 +0700
Subject: [PATCH v21 22/22] Use TIDStore for storing dead tuple TID during lazy
vacuum
Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which is not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.
This changes to use TIDStore for this purpose. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.
Also, since we are no longer able to exactly estimate the maximum
number of TIDs can be stored based on the amount of memory. It also
changes to the column names max_dead_tuples and num_dead_tuples and to
show the progress information in bytes.
Furthermore, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, which is the
inital DSA segment size. Due to that, this change increase the minimum
maintenance_work_mem from 1MB to 2MB.
XXX: needs to bump catalog version
---
doc/src/sgml/monitoring.sgml | 8 +-
src/backend/access/heap/vacuumlazy.c | 210 +++++++--------------
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 76 +-------
src/backend/commands/vacuumparallel.c | 64 ++++---
src/backend/storage/lmgr/lwlock.c | 2 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/commands/progress.h | 4 +-
src/include/commands/vacuum.h | 25 +--
src/include/storage/lwlock.h | 1 +
src/test/regress/expected/cluster.out | 2 +-
src/test/regress/expected/create_index.out | 2 +-
src/test/regress/expected/rules.out | 4 +-
src/test/regress/sql/cluster.sql | 2 +-
src/test/regress/sql/create_index.sql | 2 +-
15 files changed, 138 insertions(+), 268 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 38bc3589ae..b96bca38db 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6860,10 +6860,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>max_dead_tuples</structfield> <type>bigint</type>
+ <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples that we can store before needing to perform
+ Amount of dead tuple data that we can store before needing to perform
an index vacuum cycle, based on
<xref linkend="guc-maintenance-work-mem"/>.
</para></entry>
@@ -6871,10 +6871,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>num_dead_tuples</structfield> <type>bigint</type>
+ <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples collected since the last index vacuum cycle.
+ Amount of dead tuple data collected since the last index vacuum cycle.
</para></entry>
</row>
</tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..90f8a5e087 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TidStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -220,17 +221,21 @@ typedef struct LVRelState
typedef struct LVPagePruneState
{
bool hastup; /* Page prevents rel truncation? */
- bool has_lpdead_items; /* includes existing LP_DEAD items */
+
+ /* collected LP_DEAD items including existing LP_DEAD items */
+ int lpdead_items;
+ OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
/*
* State describes the proper VM bit states to set for the page following
- * pruning and freezing. all_visible implies !has_lpdead_items, but don't
+ * pruning and freezing. all_visible implies !HAS_LPDEAD_ITEMS(), but don't
* trust all_frozen result unless all_visible is also set to true.
*/
bool all_visible; /* Every item visible to all? */
bool all_frozen; /* provided all_visible is also true */
TransactionId visibility_cutoff_xid; /* For recovery conflicts */
} LVPagePruneState;
+#define HAS_LPDEAD_ITEMS(state) (((state).lpdead_items) > 0)
/* Struct for saving and restoring vacuum error information. */
typedef struct LVSavedErrInfo
@@ -259,8 +264,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -825,21 +831,21 @@ lazy_scan_heap(LVRelState *vacrel)
blkno,
next_unskippable_block,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TidStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
const int initprog_index[] = {
PROGRESS_VACUUM_PHASE,
PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
- PROGRESS_VACUUM_MAX_DEAD_TUPLES
+ PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
};
int64 initprog_val[3];
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +912,7 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ if (tidstore_is_full(vacrel->dead_items))
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -1018,7 +1023,7 @@ lazy_scan_heap(LVRelState *vacrel)
*/
lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
- Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+ Assert(!prunestate.all_visible || !HAS_LPDEAD_ITEMS(prunestate));
/* Remember the location of the last page with nonremovable tuples */
if (prunestate.hastup)
@@ -1034,14 +1039,12 @@ lazy_scan_heap(LVRelState *vacrel)
* performed here can be thought of as the one-pass equivalent of
* a call to lazy_vacuum().
*/
- if (prunestate.has_lpdead_items)
+ if (HAS_LPDEAD_ITEMS(prunestate))
{
Size freespace;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
- /* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+ prunestate.lpdead_items, buf, vmbuffer);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1081,16 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(tidstore_num_tids(dead_items) == 0);
+ }
+ else if (HAS_LPDEAD_ITEMS(prunestate))
+ {
+ /* Save details of the LP_DEAD items from the page */
+ tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
+ prunestate.lpdead_items);
+
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
}
/*
@@ -1145,7 +1157,7 @@ lazy_scan_heap(LVRelState *vacrel)
* There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
* set, however.
*/
- else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+ else if (HAS_LPDEAD_ITEMS(prunestate) && PageIsAllVisible(page))
{
elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
vacrel->relname, blkno);
@@ -1193,7 +1205,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Final steps for block: drop cleanup lock, record free space in the
* FSM
*/
- if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+ if (HAS_LPDEAD_ITEMS(prunestate) && vacrel->do_index_vacuuming)
{
/*
* Wait until lazy_vacuum_heap_rel() to save free space. This
@@ -1249,7 +1261,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (tidstore_num_tids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1543,13 +1555,11 @@ lazy_scan_prune(LVRelState *vacrel,
HTSV_Result res;
int tuples_deleted,
tuples_frozen,
- lpdead_items,
live_tuples,
recently_dead_tuples;
int nnewlpdead;
HeapPageFreeze pagefrz;
int64 fpi_before = pgWalUsage.wal_fpi;
- OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1571,7 +1581,6 @@ retry:
pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
tuples_deleted = 0;
tuples_frozen = 0;
- lpdead_items = 0;
live_tuples = 0;
recently_dead_tuples = 0;
@@ -1580,9 +1589,9 @@ retry:
*
* We count tuples removed by the pruning step as tuples_deleted. Its
* final value can be thought of as the number of tuples that have been
- * deleted from the table. It should not be confused with lpdead_items;
- * lpdead_items's final value can be thought of as the number of tuples
- * that were deleted from indexes.
+ * deleted from the table. It should not be confused with
+ * prunestate->lpdead_items; prunestate->lpdead_items's final value can
+ * be thought of as the number of tuples that were deleted from indexes.
*/
tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
InvalidTransactionId, 0, &nnewlpdead,
@@ -1593,7 +1602,7 @@ retry:
* requiring freezing among remaining tuples with storage
*/
prunestate->hastup = false;
- prunestate->has_lpdead_items = false;
+ prunestate->lpdead_items = 0;
prunestate->all_visible = true;
prunestate->all_frozen = true;
prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1638,7 +1647,7 @@ retry:
* (This is another case where it's useful to anticipate that any
* LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
*/
- deadoffsets[lpdead_items++] = offnum;
+ prunestate->deadoffsets[prunestate->lpdead_items++] = offnum;
continue;
}
@@ -1875,7 +1884,7 @@ retry:
*/
#ifdef USE_ASSERT_CHECKING
/* Note that all_frozen value does not matter when !all_visible */
- if (prunestate->all_visible && lpdead_items == 0)
+ if (prunestate->all_visible && prunestate->lpdead_items == 0)
{
TransactionId cutoff;
bool all_frozen;
@@ -1888,28 +1897,9 @@ retry:
}
#endif
- /*
- * Now save details of the LP_DEAD items from the page in vacrel
- */
- if (lpdead_items > 0)
+ if (prunestate->lpdead_items > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
-
vacrel->lpdead_item_pages++;
- prunestate->has_lpdead_items = true;
-
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
/*
* It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1928,7 +1918,7 @@ retry:
/* Finally, add page-local counts to whole-VACUUM counts */
vacrel->tuples_deleted += tuples_deleted;
vacrel->tuples_frozen += tuples_frozen;
- vacrel->lpdead_items += lpdead_items;
+ vacrel->lpdead_items += prunestate->lpdead_items;
vacrel->live_tuples += live_tuples;
vacrel->recently_dead_tuples += recently_dead_tuples;
}
@@ -2129,8 +2119,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TidStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2128,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2198,7 +2180,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
return;
}
@@ -2227,7 +2209,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2254,8 +2236,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2300,7 +2282,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
}
/*
@@ -2373,7 +2355,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2410,10 +2392,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index = 0;
BlockNumber vacuumed_pages = 0;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2411,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
VACUUM_ERRCB_PHASE_VACUUM_HEAP,
InvalidBlockNumber, InvalidOffsetNumber);
- while (index < vacrel->dead_items->num_items)
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ while ((result = tidstore_iterate_next(iter)) != NULL)
{
BlockNumber blkno;
Buffer buf;
@@ -2437,7 +2421,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ blkno = result->blkno;
vacrel->blkno = blkno;
/*
@@ -2451,7 +2435,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+ buf, vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2461,6 +2446,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
vacuumed_pages++;
}
+ tidstore_end_iterate(iter);
vacrel->blkno = InvalidBlockNumber;
if (BufferIsValid(vmbuffer))
@@ -2470,14 +2456,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2495,11 +2480,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* LP_DEAD item on the page. The return value is the first index immediately
* after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets, Buffer buffer, Buffer vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int nunused = 0;
@@ -2518,16 +2502,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = offsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2576,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -3093,46 +3071,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
return vacrel->nonempty_pages;
}
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
/*
* Allocate dead_items (either using palloc, or in dynamic shared memory).
* Sets dead_items in vacrel for caller.
@@ -3143,11 +3081,9 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
+ int vac_work_mem = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
@@ -3174,7 +3110,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
+ vac_work_mem,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3187,11 +3123,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = tidstore_create(vac_work_mem, NULL);
}
/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..a526e607fe 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1165,7 +1165,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
END AS phase,
S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
- S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+ S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7b1a4b127e..358ad25996 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int vac_cmp_itemptr(const void *left, const void *right);
/*
* Primary entry point for manual VACUUM and ANALYZE commands
@@ -2303,16 +2302,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TidStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ tidstore_num_tids(dead_items))));
return istat;
}
@@ -2343,18 +2342,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
@@ -2365,60 +2352,7 @@ vac_max_items_to_alloc_size(int max_items)
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch((void *) itemptr,
- (void *) dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
-
- return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
- BlockNumber lblk,
- rblk;
- OffsetNumber loff,
- roff;
-
- lblk = ItemPointerGetBlockNumber((ItemPointer) left);
- rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
- if (lblk < rblk)
- return -1;
- if (lblk > rblk)
- return 1;
-
- loff = ItemPointerGetOffsetNumber((ItemPointer) left);
- roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
- if (loff < roff)
- return -1;
- if (loff > roff)
- return 1;
+ TidStore *dead_items = (TidStore *) state;
- return 0;
+ return tidstore_lookup_tid(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..4c0ce4b7e6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
* use small integers.
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 3
#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 4
#define PARALLEL_VACUUM_KEY_WAL_USAGE 5
@@ -103,6 +103,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TidStore */
+ tidstore_handle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ dsa_area *dead_items_area;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = tidstore_create(vac_work_mem, dead_items_dsa);
+ pvs->dead_items = dead_items;
+ pvs->dead_items_area = dead_items_dsa;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = tidstore_get_handle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ tidstore_destroy(pvs->dead_items);
+ dsa_detach(pvs->dead_items_area);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TidStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ tidstore_detach(pvs.dead_items);
+ dsa_detach(dead_items_area);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index cbfe329591..4c35af3412 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -188,6 +188,8 @@ static const char *const BuiltinTrancheNames[] = {
"PgStatsHash",
/* LWTRANCHE_PGSTATS_DATA: */
"PgStatsData",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index cd0fc2cb8f..85e42269be 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2301,7 +2301,7 @@ struct config_int ConfigureNamesInt[] =
GUC_UNIT_KB
},
&maintenance_work_mem,
- 65536, 1024, MAX_KILOBYTES,
+ 65536, 2048, MAX_KILOBYTES,
NULL, NULL, NULL
},
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
#define PROGRESS_VACUUM_HEAP_BLKS_SCANNED 2
#define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED 3
#define PROGRESS_VACUUM_NUM_INDEX_VACUUMS 4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES 5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES 5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES 6
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..220d89fff7 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
MultiXactId MultiXactCutoff;
};
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TidStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int vac_work_mem,
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 7b7663e2e1..c9b4741e32 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -205,6 +205,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DSA,
LWTRANCHE_PGSTATS_HASH,
LWTRANCHE_PGSTATS_DATA,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
-- ensure we don't use the index in CLUSTER nor the checking SELECTs
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
-- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..f6ae02eb14 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param3 AS heap_blks_scanned,
s.param4 AS heap_blks_vacuumed,
s.param5 AS index_vacuum_count,
- s.param6 AS max_dead_tuples,
- s.param7 AS num_dead_tuples
+ s.param6 AS max_dead_tuple_bytes,
+ s.param7 AS dead_tuple_bytes
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
--
2.39.0
Attached is a rebase to fix conflicts from recent commits.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v22-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v22-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload
From dc2ac74612299ad60e3da958314338a7c3ff1ad5 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v22 02/22] Move some bitmap logic out of bitmapset.c
Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.
Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().
Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.
Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
src/backend/nodes/bitmapset.c | 36 ++------------------------------
src/include/nodes/bitmapset.h | 16 ++++++++++++--
src/include/port/pg_bitutils.h | 31 +++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 -
4 files changed, 47 insertions(+), 37 deletions(-)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x))
/*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
{
int result;
- w = RIGHTMOST_ONE(w);
+ w = bmw_rightmost_one(w);
a->words[wordnum] &= ~w;
result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 0dca6bc5fa..80e91fac0f 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
#else
#define BITS_PER_BITMAPWORD 32
typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
#endif
@@ -75,6 +73,20 @@ typedef enum
BMS_MULTIPLE /* >1 member */
} BMS_Membership;
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w) pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
+#define bmw_popcount(w) pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w) pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
+#define bmw_popcount(w) pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
/*
* function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+ int32 result = (int32) word & -((int32) word);
+ return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+ int64 result = (int64) word & -((int64) word);
+ return (uint64) result;
+}
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 51484ca7e2..077f197a64 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3662,7 +3662,6 @@ shmem_request_hook_type
shmem_startup_hook_type
sig_atomic_t
sigjmp_buf
-signedbitmapword
sigset_t
size_t
slist_head
--
2.39.0
v22-0004-Clean-up-some-nomenclature-around-node-insertion.patchtext/x-patch; charset=US-ASCII; name=v22-0004-Clean-up-some-nomenclature-around-node-insertion.patchDownload
From 3dab17562a62b9e5086bcf473cf1a81768f70552 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Thu, 19 Jan 2023 16:33:51 +0700
Subject: [PATCH v22 04/22] Clean up some nomenclature around node insertion
Replace node/nodep with hopefully more informative names.
In passing, remove some outdated asserts and move some
variable declarations to the scope where they're used.
---
src/include/lib/radixtree.h | 64 ++++++++++++++-----------
src/include/lib/radixtree_insert_impl.h | 22 +++++----
2 files changed, 47 insertions(+), 39 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 97cccdc9ca..a1458bc25f 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -645,9 +645,9 @@ typedef struct RT_ITER
} RT_ITER;
-static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
uint64 key, RT_PTR_ALLOC child);
-static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
uint64 key, uint64 value);
/* verification (available only with assertion) */
@@ -1153,18 +1153,18 @@ RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
* Replace old_child with new_child, and free the old one.
*/
static void
-RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
RT_PTR_ALLOC new_child, uint64 key)
{
- RT_PTR_LOCAL old = RT_PTR_GET_LOCAL(tree, old_child);
-
#ifdef USE_ASSERT_CHECKING
RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
- Assert(old->shift == new->shift);
+ Assert(old_child->shift == new->shift);
+ Assert(old_child->count == new->count);
#endif
- if (parent == old)
+ if (parent == old_child)
{
/* Replace the root node with the new large node */
tree->ctl->root = new_child;
@@ -1172,7 +1172,7 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child
else
RT_NODE_UPDATE_INNER(parent, key, new_child);
- RT_FREE_NODE(tree, old_child);
+ RT_FREE_NODE(tree, stored_old_child);
}
/*
@@ -1220,11 +1220,11 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
*/
static inline void
RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
- RT_PTR_ALLOC nodep, RT_PTR_LOCAL node)
+ RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
{
int shift = node->shift;
- Assert(RT_PTR_GET_LOCAL(tree, nodep) == node);
+ Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
while (shift >= RT_NODE_SPAN)
{
@@ -1237,15 +1237,15 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent
newchild = RT_PTR_GET_LOCAL(tree, allocchild);
RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
newchild->shift = newshift;
- RT_NODE_INSERT_INNER(tree, parent, nodep, node, key, allocchild);
+ RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
parent = node;
node = newchild;
- nodep = allocchild;
+ stored_node = allocchild;
shift -= RT_NODE_SPAN;
}
- RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+ RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value);
tree->ctl->num_keys++;
}
@@ -1305,9 +1305,15 @@ RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
}
#endif
-/* Insert the child to the inner node */
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
static bool
-RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
uint64 key, RT_PTR_ALLOC child)
{
#define RT_NODE_LEVEL_INNER
@@ -1315,9 +1321,9 @@ RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC node
#undef RT_NODE_LEVEL_INNER
}
-/* Insert the value to the leaf node */
+/* Like, RT_NODE_INSERT_INNER, but for leaf nodes */
static bool
-RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
uint64 key, uint64 value)
{
#define RT_NODE_LEVEL_LEAF
@@ -1525,8 +1531,8 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
int shift;
bool updated;
RT_PTR_LOCAL parent;
- RT_PTR_ALLOC nodep;
- RT_PTR_LOCAL node;
+ RT_PTR_ALLOC stored_child;
+ RT_PTR_LOCAL child;
#ifdef RT_SHMEM
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
@@ -1540,32 +1546,32 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
if (key > tree->ctl->max_val)
RT_EXTEND(tree, key);
- nodep = tree->ctl->root;
- parent = RT_PTR_GET_LOCAL(tree, nodep);
+ stored_child = tree->ctl->root;
+ parent = RT_PTR_GET_LOCAL(tree, stored_child);
shift = parent->shift;
/* Descend the tree until a leaf node */
while (shift >= 0)
{
- RT_PTR_ALLOC child;
+ RT_PTR_ALLOC new_child;
- node = RT_PTR_GET_LOCAL(tree, nodep);
+ child = RT_PTR_GET_LOCAL(tree, stored_child);
- if (NODE_IS_LEAF(node))
+ if (NODE_IS_LEAF(child))
break;
- if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
{
- RT_SET_EXTEND(tree, key, value, parent, nodep, node);
+ RT_SET_EXTEND(tree, key, value, parent, stored_child, child);
return false;
}
- parent = node;
- nodep = child;
+ parent = child;
+ stored_child = new_child;
shift -= RT_NODE_SPAN;
}
- updated = RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+ updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value);
/* Update the statistics */
if (!updated)
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index e4faf54d9d..1d0eb396e2 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -14,8 +14,6 @@
uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
bool chunk_exists = false;
- RT_PTR_LOCAL newnode = NULL;
- RT_PTR_ALLOC allocnode;
#ifdef RT_NODE_LEVEL_LEAF
const bool inner = false;
@@ -47,6 +45,8 @@
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
{
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
RT_NODE32_TYPE *new32;
const uint8 new_kind = RT_NODE_KIND_32;
const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
@@ -65,8 +65,7 @@
RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
new32->base.chunks, new32->children);
#endif
- Assert(parent != NULL);
- RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
node = newnode;
}
else
@@ -121,6 +120,8 @@
n32->base.n.fanout == class32_min.fanout)
{
/* grow to the next size class of this kind */
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
allocnode = RT_ALLOC_NODE(tree, new_class, inner);
@@ -132,8 +133,7 @@
#endif
newnode->fanout = class32_max.fanout;
- Assert(parent != NULL);
- RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
node = newnode;
/* also update pointer for this kind */
@@ -142,6 +142,8 @@
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
{
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
RT_NODE125_TYPE *new125;
const uint8 new_kind = RT_NODE_KIND_125;
const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
@@ -169,8 +171,7 @@
Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
- Assert(parent != NULL);
- RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
node = newnode;
}
else
@@ -220,6 +221,8 @@
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
{
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
RT_NODE256_TYPE *new256;
const uint8 new_kind = RT_NODE_KIND_256;
const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
@@ -243,8 +246,7 @@
cnt++;
}
- Assert(parent != NULL);
- RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
node = newnode;
}
else
--
2.39.0
v22-0001-introduce-vector8_min-and-vector8_highbit_mask.patchtext/x-patch; charset=US-ASCII; name=v22-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload
From 2b4d8c3a7a538c029faaa14ef5f22beec10406bc Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v22 01/22] introduce vector8_min and vector8_highbit_mask
TODO: commit message
TODO: Remove uint64 case.
separate-commit TODO: move non-SIMD fallbacks to own header
to clean up the #ifdef maze.
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..84d41a340a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
static inline bool vector8_has_zero(const Vector8 v);
static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
#endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
#endif
}
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+ uint32 mask = 0;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+ return mask;
+#endif
+}
+
/*
* Exactly like vector8_is_highbit_set except for the input type, so it
* looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.39.0
v22-0005-Restore-RT_GROW_NODE_KIND.patchtext/x-patch; charset=US-ASCII; name=v22-0005-Restore-RT_GROW_NODE_KIND.patchDownload
From 413cce02ce2419d4760f411a77f24213958ea906 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Jan 2023 11:32:24 +0700
Subject: [PATCH v22 05/22] Restore RT_GROW_NODE_KIND
(This was previously "exploded" out during the work to
switch this to a template)
Change the API so that we pass it the allocated pointer
and return the local pointer. That way, there is consistency
in growing nodes whether we change kind or not.
Also rename to RT_SWITCH_NODE_KIND, since it should work just as
well for shrinking nodes.
---
src/include/lib/radixtree.h | 104 +++---------------------
src/include/lib/radixtree_insert_impl.h | 24 ++----
2 files changed, 19 insertions(+), 109 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index a1458bc25f..c08016de3a 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -127,10 +127,9 @@
#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
#define RT_INIT_NODE RT_MAKE_NAME(init_node)
#define RT_FREE_NODE RT_MAKE_NAME(free_node)
-#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
#define RT_EXTEND RT_MAKE_NAME(extend)
#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
-//#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
@@ -1080,26 +1079,22 @@ RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
newnode->shift = oldnode->shift;
newnode->count = oldnode->count;
}
-#if 0
+
/*
- * Create a new node with 'new_kind' and the same shift, chunk, and
- * count of 'node'.
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
*/
-static RT_NODE*
-RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_LOCAL node, uint8 new_kind)
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+ uint8 new_kind, uint8 new_class, bool inner)
{
- RT_PTR_ALLOC allocnode;
- RT_PTR_LOCAL newnode;
- bool inner = !NODE_IS_LEAF(node);
-
- allocnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
- newnode = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(newnode, new_kind, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+ RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, inner);
RT_COPY_NODE(newnode, node);
return newnode;
}
-#endif
+
/* Free the given node */
static void
RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
@@ -1415,78 +1410,6 @@ RT_GET_HANDLE(RT_RADIX_TREE *tree)
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
return tree->ctl->handle;
}
-
-/*
- * Recursively free all nodes allocated to the DSA area.
- */
-static inline void
-RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
-{
- RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
-
- check_stack_depth();
- CHECK_FOR_INTERRUPTS();
-
- /* The leaf node doesn't have child pointers */
- if (NODE_IS_LEAF(node))
- {
- dsa_free(tree->dsa, ptr);
- return;
- }
-
- switch (node->kind)
- {
- case RT_NODE_KIND_4:
- {
- RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
-
- for (int i = 0; i < n4->base.n.count; i++)
- RT_FREE_RECURSE(tree, n4->children[i]);
-
- break;
- }
- case RT_NODE_KIND_32:
- {
- RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
-
- for (int i = 0; i < n32->base.n.count; i++)
- RT_FREE_RECURSE(tree, n32->children[i]);
-
- break;
- }
- case RT_NODE_KIND_125:
- {
- RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
-
- for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
- {
- if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
- continue;
-
- RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
- }
-
- break;
- }
- case RT_NODE_KIND_256:
- {
- RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
-
- for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
- {
- if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
- continue;
-
- RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
- }
-
- break;
- }
- }
-
- /* Free the inner node */
- dsa_free(tree->dsa, ptr);
-}
#endif
/*
@@ -1498,10 +1421,6 @@ RT_FREE(RT_RADIX_TREE *tree)
#ifdef RT_SHMEM
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
- /* Free all memory used for radix tree nodes */
- if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
- RT_FREE_RECURSE(tree, tree->ctl->root);
-
/*
* Vandalize the control block to help catch programming error where
* other backends access the memory formerly occupied by this radix tree.
@@ -2280,10 +2199,9 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_ALLOC_NODE
#undef RT_INIT_NODE
#undef RT_FREE_NODE
-#undef RT_FREE_RECURSE
#undef RT_EXTEND
#undef RT_SET_EXTEND
-#undef RT_GROW_NODE_KIND
+#undef RT_SWITCH_NODE_KIND
#undef RT_COPY_NODE
#undef RT_REPLACE_NODE
#undef RT_PTR_GET_LOCAL
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 1d0eb396e2..e3e44669ea 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -53,11 +53,9 @@
/* grow node from 4 to 32 */
allocnode = RT_ALLOC_NODE(tree, new_class, inner);
- newnode = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(newnode, new_kind, new_class, inner);
- RT_COPY_NODE(newnode, node);
- //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
new32 = (RT_NODE32_TYPE *) newnode;
+
#ifdef RT_NODE_LEVEL_LEAF
RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
new32->base.chunks, new32->values);
@@ -119,13 +117,15 @@
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
n32->base.n.fanout == class32_min.fanout)
{
- /* grow to the next size class of this kind */
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL newnode;
const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
+ /* grow to the next size class of this kind */
allocnode = RT_ALLOC_NODE(tree, new_class, inner);
newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ n32 = (RT_NODE32_TYPE *) newnode;
+
#ifdef RT_NODE_LEVEL_LEAF
memcpy(newnode, node, class32_min.leaf_size);
#else
@@ -135,9 +135,6 @@
RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
node = newnode;
-
- /* also update pointer for this kind */
- n32 = (RT_NODE32_TYPE *) newnode;
}
if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
@@ -152,10 +149,7 @@
/* grow node from 32 to 125 */
allocnode = RT_ALLOC_NODE(tree, new_class, inner);
- newnode = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(newnode, new_kind, new_class, inner);
- RT_COPY_NODE(newnode, node);
- //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
new125 = (RT_NODE125_TYPE *) newnode;
for (int i = 0; i < class32_max.fanout; i++)
@@ -229,11 +223,9 @@
/* grow node from 125 to 256 */
allocnode = RT_ALLOC_NODE(tree, new_class, inner);
- newnode = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(newnode, new_kind, new_class, inner);
- RT_COPY_NODE(newnode, node);
- //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
new256 = (RT_NODE256_TYPE *) newnode;
+
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
{
if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
--
2.39.0
v22-0003-Add-radixtree-template.patchtext/x-patch; charset=US-ASCII; name=v22-0003-Add-radixtree-template.patchDownload
From 7af7400b44e61957b38d4c974cdd4606c32f6b0f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v22 03/22] Add radixtree template
The only thing configurable in this commit is function scope,
prefix, and local/shared memory.
The key and value type are still hard-coded to uint64.
(A later commit in v21 will make value type configurable)
It might be good at some point to offer a different tree type,
e.g. "single-value leaves" to allow for variable length keys
and values, giving full flexibility to developers.
TODO: Much broader commit message
---
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 2321 +++++++++++++++++
src/include/lib/radixtree_delete_impl.h | 106 +
src/include/lib/radixtree_insert_impl.h | 316 +++
src/include/lib/radixtree_iter_impl.h | 138 +
src/include/lib/radixtree_search_impl.h | 131 +
src/include/utils/dsa.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 36 +
src/test/modules/test_radixtree/meson.build | 35 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 653 +++++
.../test_radixtree/test_radixtree.control | 4 +
src/tools/pginclude/cpluspluscheck | 6 +
src/tools/pginclude/headerscheck | 6 +
20 files changed, 3816 insertions(+)
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/include/lib/radixtree_delete_impl.h
create mode 100644 src/include/lib/radixtree_insert_impl.h
create mode 100644 src/include/lib/radixtree_iter_impl.h
create mode 100644 src/include/lib/radixtree_search_impl.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 604b702a91..50f0aae3ab 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..97cccdc9ca
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2321 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves". We
+ * choose it to avoid an additional pointer traversal. It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included. Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * will result in radix tree type 'foo_radix_tree' and functions like
+ * 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ * generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ * declarations reside
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ * so that multiple processes can access it simultaneously.
+ *
+ * Optional parameters:
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE - Create a new, empty radix tree
+ * RT_FREE - Free the radix tree
+ * RT_ATTACH - Attach to the radix tree
+ * RT_DETACH - Detach from the radix tree
+ * RT_GET_HANDLE - Return the handle of the radix tree
+ * RT_SEARCH - Search a key-value pair
+ * RT_SET - Set a key-value pair
+ * RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT - Return next key-value pair, if any
+ * RT_END_ITER - End iteration
+ * RT_MEMORY_USAGE - Get the memory usage
+ *
+ * RT_CREATE() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * RT_ITERATE_NEXT() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE - Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+//#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_4_SEARCH_EQ RT_MAKE_NAME(node_4_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_4_GET_INSERTPOS RT_MAKE_NAME(node_4_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_4 RT_MAKE_NAME(node_inner_4)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_4 RT_MAKE_NAME(node_leaf_4)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_4_FULL RT_MAKE_NAME(class_4_full)
+#define RT_CLASS_32_PARTIAL RT_MAKE_NAME(class_32_partial)
+#define RT_CLASS_32_FULL RT_MAKE_NAME(class_32_full)
+#define RT_CLASS_125_FULL RT_MAKE_NAME(class_125_full)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+#define RT_KIND_MIN_SIZE_CLASS RT_MAKE_NAME(kind_min_size_class)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif /* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* macros and types common to all implementations */
+#ifndef RT_COMMON
+#define RT_COMMON
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
+#define BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of RT_NODE. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+#endif /* RT_COMMON */
+
+
+typedef enum RT_SIZE_CLASS
+{
+ RT_CLASS_4_FULL = 0,
+ RT_CLASS_32_PARTIAL,
+ RT_CLASS_32_FULL,
+ RT_CLASS_125_FULL,
+ RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Max capacity for the current size class. Storing this in the
+ * node enables multiple size classes per node kind.
+ * Technically, kinds with a single size class don't need this, so we could
+ * keep this in the individual base types, but the code is simpler this way.
+ * Note: node256 is unique in that it cannot possibly have more than a
+ * single size class, so for that kind we store zero, and uint8 is
+ * sufficient for other kinds.
+ */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#define NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
+#define NODE_IS_EMPTY(n) (((RT_PTR_LOCAL) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+ ((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+ ((node)->base.n.count < RT_SIZE_CLASS_INFO[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct RT_NODE_BASE_4
+{
+ RT_NODE n;
+
+ /* 4 children, for key chunks */
+ uint8 chunks[4];
+} RT_NODE_BASE_4;
+
+typedef struct RT_NODE_BASE_32
+{
+ RT_NODE n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct RT_NODE_BASE_125
+{
+ RT_NODE n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[BM_IDX(128)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+ RT_NODE n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ * width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct RT_NODE_INNER_4
+{
+ RT_NODE_BASE_4 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_4;
+
+typedef struct RT_NODE_LEAF_4
+{
+ RT_NODE_BASE_4 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_4;
+
+typedef struct RT_NODE_INNER_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of values depends on size class */
+ uint64 values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct RT_NODE_INNER_256
+{
+ RT_NODE_BASE_256 base;
+
+ /* Slots for 256 children */
+ RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+ RT_NODE_BASE_256 base;
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[BM_IDX(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ uint64 values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+
+ /* slab block size */
+ Size inner_blocksize;
+ Size leaf_blocksize;
+} RT_SIZE_CLASS_ELEM;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+ [RT_CLASS_4_FULL] = {
+ .name = "radix tree node 4",
+ .fanout = 4,
+ .inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_PARTIAL] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64)),
+ },
+ [RT_CLASS_32_FULL] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64)),
+ },
+ [RT_CLASS_125_FULL] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64)),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(RT_NODE_INNER_256),
+ .leaf_size = sizeof(RT_NODE_LEAF_256),
+ .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_256)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_256)),
+ },
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+/* Map from the node kind to its minimum size class */
+static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
+ [RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+ [RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+ [RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+ [RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+ RT_HANDLE handle;
+ uint32 magic;
+#endif
+
+ RT_PTR_ALLOC root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE
+{
+ MemoryContext context;
+
+ /* pointing to either local memory or DSA */
+ RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+ dsa_area *dsa;
+#else
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
++ *
++ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
++ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
++ * We need either a safeguard to disallow other processes to begin the iteration
++ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+ RT_PTR_LOCAL node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+ RT_RADIX_TREE *tree;
+
+ /* Track the iteration on nodes of each level */
+ RT_NODE_ITER stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is being constructed during the iteration */
+ uint64 key;
+} RT_ITER;
+
+
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+ uint64 key, uint64 value);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+ return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+ return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+ return DsaPointerIsValid(ptr);
+#else
+ return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+RT_NODE_4_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+RT_NODE_4_GET_INSERTPOS(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ if (node->chunks[index] >= chunk)
+ break;
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, uint64 *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+ uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, uint64 *src_values,
+ uint8 *dst_chunks, uint64 *dst_values)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(uint64) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+ return node->children[chunk];
+}
+
+static inline uint64
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, uint64 value)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[idx] |= ((bitmapword) 1 << bitnum);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!NODE_IS_LEAF(node));
+ node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(NODE_IS_LEAF(node));
+ node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+ return (key == 0)
+ ? 0
+ : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
+{
+ RT_PTR_ALLOC allocnode;
+ size_t allocsize;
+
+ if (inner)
+ allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+ else
+ allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+
+#ifdef RT_SHMEM
+ allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+ if (inner)
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+ allocsize);
+ else
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ allocsize);
+#endif
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->ctl->cnt[size_class]++;
+#endif
+
+ return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner)
+{
+ if (inner)
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+ else
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+
+ node->kind = kind;
+
+ if (kind == RT_NODE_KIND_256)
+ /* See comment for the RT_NODE type */
+ Assert(node->fanout == 0);
+ else
+ node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+ memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+ }
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+ int shift = RT_KEY_GET_SHIFT(key);
+ bool inner = shift > 0;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newnode->shift = shift;
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+ tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->count = oldnode->count;
+}
+#if 0
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static RT_NODE*
+RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_LOCAL node, uint8 new_kind)
+{
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ bool inner = !NODE_IS_LEAF(node);
+
+ allocnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+ RT_COPY_NODE(newnode, node);
+
+ return newnode;
+}
+#endif
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->ctl->root == allocnode)
+ {
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+ tree->ctl->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->ctl->cnt[i]--;
+ Assert(tree->ctl->cnt[i] >= 0);
+ }
+#endif
+
+#ifdef RT_SHMEM
+ dsa_free(tree->dsa, allocnode);
+#else
+ pfree(allocnode);
+#endif
+}
+
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
+ RT_PTR_ALLOC new_child, uint64 key)
+{
+ RT_PTR_LOCAL old = RT_PTR_GET_LOCAL(tree, old_child);
+
+#ifdef USE_ASSERT_CHECKING
+ RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+ Assert(old->shift == new->shift);
+#endif
+
+ if (parent == old)
+ {
+ /* Replace the root node with the new large node */
+ tree->ctl->root = new_child;
+ }
+ else
+ RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+ RT_FREE_NODE(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+ int target_shift;
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ int shift = root->shift + RT_NODE_SPAN;
+
+ target_shift = RT_KEY_GET_SHIFT(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ RT_NODE_INNER_4 *n4;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+ node->shift = shift;
+ node->count = 1;
+
+ n4 = (RT_NODE_INNER_4 *) node;
+ n4->base.chunks[0] = 0;
+ n4->children[0] = tree->ctl->root;
+
+ /* Update the root */
+ tree->ctl->root = allocnode;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC nodep, RT_PTR_LOCAL node)
+{
+ int shift = node->shift;
+
+ Assert(RT_PTR_GET_LOCAL(tree, nodep) == node);
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ RT_PTR_ALLOC allocchild;
+ RT_PTR_LOCAL newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool inner = newshift > 0;
+
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ newchild->shift = newshift;
+ RT_NODE_INSERT_INNER(tree, parent, nodep, node, key, allocchild);
+
+ parent = node;
+ node = newchild;
+ nodep = allocchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+ tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/* Insert the child to the inner node */
+static bool
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Insert the value to the leaf node */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+ uint64 key, uint64 value)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+ RT_RADIX_TREE *tree;
+ MemoryContext old_ctx;
+#ifdef RT_SHMEM
+ dsa_pointer dp;
+#endif
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+ tree->context = ctx;
+
+#ifdef RT_SHMEM
+ tree->dsa = dsa;
+ dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+ tree->ctl->handle = dp;
+ tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+#else
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+ /* Create the slab allocator for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ RT_SIZE_CLASS_INFO[i].name,
+ RT_SIZE_CLASS_INFO[i].inner_blocksize,
+ RT_SIZE_CLASS_INFO[i].inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ RT_SIZE_CLASS_INFO[i].name,
+ RT_SIZE_CLASS_INFO[i].leaf_blocksize,
+ RT_SIZE_CLASS_INFO[i].leaf_size);
+ }
+#endif
+
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+ RT_RADIX_TREE *tree;
+ dsa_pointer control;
+
+ /* XXX: memory context support */
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ tree->dsa = dsa;
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static inline void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();
+
+ /* The leaf node doesn't have child pointers */
+ if (NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->dsa, ptr);
+ return;
+ }
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+ for (int i = 0; i < n4->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n4->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ for (int i = 0; i < n32->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+ }
+
+ break;
+ }
+ }
+
+ /* Free the inner node */
+ dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ /* Free all memory used for radix tree nodes */
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_FREE_RECURSE(tree, tree->ctl->root);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->dsa, tree->ctl->handle);
+#else
+ pfree(tree->ctl);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+#endif
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
+{
+ int shift;
+ bool updated;
+ RT_PTR_LOCAL parent;
+ RT_PTR_ALLOC nodep;
+ RT_PTR_LOCAL node;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ /* Empty tree, create the root */
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_NEW_ROOT(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->ctl->max_val)
+ RT_EXTEND(tree, key);
+
+ nodep = tree->ctl->root;
+ parent = RT_PTR_GET_LOCAL(tree, nodep);
+ shift = parent->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child;
+
+ node = RT_PTR_GET_LOCAL(tree, nodep);
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_SET_EXTEND(tree, key, value, parent, nodep, node);
+ return false;
+ }
+
+ parent = node;
+ nodep = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->ctl->num_keys++;
+
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+ Assert(value_p != NULL);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ return false;
+
+ node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ shift = node->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ return false;
+
+ node = RT_PTR_GET_LOCAL(tree, child);
+ shift -= RT_NODE_SPAN;
+ }
+
+ return RT_NODE_SEARCH_LEAF(node, key, value_p);
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ return false;
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ RT_PTR_ALLOC child;
+
+ /* Push the current node to the stack */
+ stack[++level] = allocnode;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ return false;
+
+ allocnode = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_LEAF(node, key);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->ctl->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (!NODE_IS_EMPTY(node))
+ return true;
+
+ /* Free the empty leaf node */
+ RT_FREE_NODE(tree, allocnode);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ allocnode = stack[level--];
+
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_INNER(node, key);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!NODE_IS_EMPTY(node))
+ break;
+
+ /* The node became empty */
+ RT_FREE_NODE(tree, allocnode);
+ }
+
+ return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+ uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+ int level = from;
+ RT_PTR_LOCAL node = from_node;
+
+ for (;;)
+ {
+ RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/* Create and return the iterator for the given radix tree */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+ MemoryContext old_ctx;
+ RT_ITER *iter;
+ RT_PTR_LOCAL root;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree->ctl->root)
+ return iter;
+
+ root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+ top_level = root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->ctl->root)
+ return false;
+
+ for (;;)
+ {
+ RT_PTR_LOCAL child = NULL;
+ uint64 value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+ pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+ // XXX is this necessary?
+ Size total = sizeof(RT_RADIX_TREE);
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ total = dsa_get_total_size(tree->dsa);
+#else
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+#endif
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE_BASE_4 *n4 = (RT_NODE_BASE_4 *) node;
+
+ for (int i = 1; i < n4->n.count; i++)
+ Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ uint8 slot = n125->slot_idxs[i];
+ int idx = BM_IDX(slot);
+ int bitnum = BM_BIT(slot);
+
+ if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(slot < node->fanout);
+ Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
+ cnt += bmw_popcount(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+ ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+ tree->ctl->num_keys,
+ tree->ctl->root->shift / RT_NODE_SPAN,
+ tree->ctl->cnt[RT_CLASS_4_FULL],
+ tree->ctl->cnt[RT_CLASS_32_PARTIAL],
+ tree->ctl->cnt[RT_CLASS_32_FULL],
+ tree->ctl->cnt[RT_CLASS_125_FULL],
+ tree->ctl->cnt[RT_CLASS_256])));
+}
+
+static void
+RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
+{
+ char space[125] = {0};
+
+ fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
+ NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift);
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n4->base.chunks[i], n4->values[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n4->base.chunks[i]);
+
+ if (recurse)
+ RT_DUMP_NODE(n4->children[i], level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, n32->base.chunks[i], n32->values[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ RT_DUMP_NODE(n32->children[i], level + 1, recurse);
+ }
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+
+ fprintf(stderr, "slot_idxs ");
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+ }
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
+
+ fprintf(stderr, ", isset-bitmap:");
+ for (int i = 0; i < BM_IDX(128); i++)
+ {
+ fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
+ }
+ fprintf(stderr, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+ }
+ else
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ RT_DUMP_NODE(RT_NODE_INNER_125_GET_CHILD(n125, i),
+ level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ space, i, RT_NODE_LEAF_256_GET_VALUE(n256, i));
+ }
+ else
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ RT_DUMP_NODE(RT_NODE_INNER_256_GET_CHILD(n256, i), level + 1,
+ recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+ tree->ctl->max_val, tree->ctl->max_val);
+
+ if (!tree->ctl->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->ctl->max_val)
+ {
+ elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+ key, key);
+ return;
+ }
+
+ node = tree->ctl->root;
+ shift = tree->ctl->root->shift;
+ while (shift >= 0)
+ {
+ RT_PTR_LOCAL child;
+
+ RT_DUMP_NODE(node, level, false);
+
+ if (NODE_IS_LEAF(node))
+ {
+ uint64 dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+ break;
+ }
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+ RT_SIZE_CLASS_INFO[i].name,
+ RT_SIZE_CLASS_INFO[i].inner_size,
+ RT_SIZE_CLASS_INFO[i].inner_blocksize,
+ RT_SIZE_CLASS_INFO[i].leaf_size,
+ RT_SIZE_CLASS_INFO[i].leaf_blocksize);
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+
+ if (!tree->ctl->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ RT_DUMP_NODE(tree->ctl->root, 0, true);
+}
+#endif
+
+#endif /* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+
+/* locally declared macros */
+#undef NODE_IS_LEAF
+#undef NODE_IS_EMPTY
+#undef VAR_NODE_HAS_FREE_SLOT
+#undef FIXED_NODE_HAS_FREE_SLOT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_RADIX_TREE_MAGIC
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_BASE_4
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_4
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_4
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_4_FULL
+#undef RT_CLASS_32_PARTIAL
+#undef RT_CLASS_32_FULL
+#undef RT_CLASS_125_FULL
+#undef RT_CLASS_256
+#undef RT_KIND_MIN_SIZE_CLASS
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_GROW_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_4_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_4_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..eb87866b90
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,106 @@
+/* TODO: shrink nodes */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(NODE_IS_LEAF(node));
+#else
+ Assert(!NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ int idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, (uint64 *) n4->values,
+ n4->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
+ n4->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, (uint64 *) n32->values,
+ n32->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int idx;
+ int bitnum;
+
+ if (slotpos == RT_NODE_125_INVALID_IDX)
+ return false;
+
+ idx = BM_IDX(slotpos);
+ bitnum = BM_BIT(slotpos);
+ n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+ n125->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+ RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+ break;
+ }
+ }
+
+ /* update statistics */
+ node->count--;
+
+ return true;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..e4faf54d9d
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,316 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+ RT_PTR_LOCAL newnode = NULL;
+ RT_PTR_ALLOC allocnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ const bool inner = false;
+ Assert(NODE_IS_LEAF(node));
+#else
+ const bool inner = true;
+ Assert(!NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ int idx;
+
+ idx = RT_NODE_4_SEARCH_EQ(&n4->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n4->values[idx] = value;
+#else
+ n4->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ {
+ RT_NODE32_TYPE *new32;
+ const uint8 new_kind = RT_NODE_KIND_32;
+ const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+ /* grow node from 4 to 32 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, inner);
+ RT_COPY_NODE(newnode, node);
+ //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
+ new32 = (RT_NODE32_TYPE *) newnode;
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
+ new32->base.chunks, new32->values);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
+ new32->base.chunks, new32->children);
+#endif
+ Assert(parent != NULL);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_4_GET_INSERTPOS(&n4->base, chunk);
+ int count = n4->base.n.count;
+
+ /* shift chunks and children */
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n4->base.chunks, n4->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n4->base.chunks, n4->children,
+ count, insertpos);
+#endif
+ }
+
+ n4->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n4->values[insertpos] = value;
+#else
+ n4->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_PARTIAL];
+ const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_FULL];
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx;
+
+ idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[idx] = value;
+#else
+ n32->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+ n32->base.n.fanout == class32_min.fanout)
+ {
+ /* grow to the next size class of this kind */
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
+
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+#ifdef RT_NODE_LEVEL_LEAF
+ memcpy(newnode, node, class32_min.leaf_size);
+#else
+ memcpy(newnode, node, class32_min.inner_size);
+#endif
+ newnode->fanout = class32_max.fanout;
+
+ Assert(parent != NULL);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ node = newnode;
+
+ /* also update pointer for this kind */
+ n32 = (RT_NODE32_TYPE *) newnode;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ {
+ RT_NODE125_TYPE *new125;
+ const uint8 new_kind = RT_NODE_KIND_125;
+ const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+ Assert(n32->base.n.fanout == class32_max.fanout);
+
+ /* grow node from 32 to 125 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, inner);
+ RT_COPY_NODE(newnode, node);
+ //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
+ new125 = (RT_NODE125_TYPE *) newnode;
+
+ for (int i = 0; i < class32_max.fanout; i++)
+ {
+ new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ new125->values[i] = n32->values[i];
+#else
+ new125->children[i] = n32->children[i];
+#endif
+ }
+
+ Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+ new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+ Assert(parent != NULL);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+ int count = n32->base.n.count;
+
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+ count, insertpos);
+#endif
+ }
+
+ n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = value;
+#else
+ n32->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int cnt = 0;
+
+ if (slotpos != RT_NODE_125_INVALID_IDX)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = value;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ {
+ RT_NODE256_TYPE *new256;
+ const uint8 new_kind = RT_NODE_KIND_256;
+ const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+ /* grow node from 125 to 256 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, inner);
+ RT_COPY_NODE(newnode, node);
+ //newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
+ new256 = (RT_NODE256_TYPE *) newnode;
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+ RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ cnt++;
+ }
+
+ Assert(parent != NULL);
+ RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int idx;
+ bitmapword inverse;
+
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < BM_IDX(128); idx++)
+ {
+ if (n125->base.isset[idx] < ~((bitmapword) 0))
+ break;
+ }
+
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(n125->base.isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+ Assert(slotpos < node->fanout);
+
+ /* mark the slot used */
+ n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+ n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = value;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+#else
+ chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
+#endif
+ Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(n256, chunk, value);
+#else
+ RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ RT_VERIFY_NODE(node);
+
+ return chunk_exists;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..0b8b68df6c
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,138 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ bool found = false;
+ uint8 key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ uint64 value;
+
+ Assert(NODE_IS_LEAF(node_iter->node));
+#else
+ RT_PTR_LOCAL child = NULL;
+
+ Assert(!NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+ Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n4->base.n.count)
+ break;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n4->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n4->children[node_iter->current_idx]);
+#endif
+ key_chunk = n4->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+#ifdef RT_NODE_LEVEL_LEAF
+ if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+ if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = value;
+#endif
+ }
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return found;
+#else
+ return child;
+#endif
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..31e4978e4f
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,131 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ uint64 value = 0;
+
+ Assert(NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+#endif
+ Assert(!NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ int idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n4->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n4->values[idx];
+#else
+ child = n4->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n32->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[idx];
+#else
+ child = n32->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+ Assert(slotpos != RT_NODE_125_INVALID_IDX);
+ n125->children[slotpos] = new_child;
+#else
+ if (slotpos == RT_NODE_125_INVALID_IDX)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+ child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+ RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+ child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ }
+
+#ifdef RT_ACTION_UPDATE
+ return;
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(value_p != NULL);
+ *value_p = value;
+#else
+ Assert(child_p != NULL);
+ *child_p = child;
+#endif
+
+ return true;
+#endif /* RT_ACTION_UPDATE */
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 104386e674..c67f936880 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
test_pg_db_role_setting \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
subdir('test_pg_db_role_setting')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..d8323f587f
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,653 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int rt_node_kind_fanouts[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 125, /* RT_NODE_KIND_125 */
+ 256 /* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ uint64 dummy;
+ uint64 key;
+ uint64 val;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ rt_radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* look up keys */
+ for (int i = 0; i < children; i++)
+ {
+ uint64 value;
+
+ if (!rt_search(radixtree, keys[i], &value))
+ elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (value != keys[i])
+ elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+ value, keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_set(radixtree, keys[i], keys[i] + 1))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ uint64 val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx - 1]
+ : rt_node_kind_fanouts[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx]
+ : rt_node_kind_fanouts[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(radixtree_ctx, dsa);
+#else
+ radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ uint64 val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ uint64 v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ rt_free(radixtree);
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ test_basic(rt_node_kind_fanouts[i], false);
+ test_basic(rt_node_kind_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
--
2.39.0
v22-0007-Make-value-type-configurable.patchtext/x-patch; charset=US-ASCII; name=v22-0007-Make-value-type-configurable.patchDownload
From b0cc522b623c126c97b65376b1e7a071cb69f1c6 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Jan 2023 14:19:15 +0700
Subject: [PATCH v22 07/22] Make value type configurable
Tests pass with uint32, although the test module builds
with warnings.
---
src/include/lib/radixtree.h | 79 ++++++++++---------
src/include/lib/radixtree_delete_impl.h | 4 +-
src/include/lib/radixtree_iter_impl.h | 2 +-
src/include/lib/radixtree_search_impl.h | 2 +-
.../modules/test_radixtree/test_radixtree.c | 41 ++++++----
5 files changed, 69 insertions(+), 59 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 98e4597eac..0a39bd6664 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -44,6 +44,7 @@
* declarations reside
* - RT_SHMEM - if defined, the radix tree is created in the DSA area
* so that multiple processes can access it simultaneously.
+ * - RT_VALUE_TYPE - the type of the value.
*
* Optional parameters:
* - RT_DEBUG - if defined add stats tracking and debugging functions
@@ -222,14 +223,14 @@ RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
#endif
RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
-RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
-RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *val_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE val);
#ifdef RT_USE_DELETE
RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
#endif
RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
-RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
@@ -435,7 +436,7 @@ typedef struct RT_NODE_LEAF_4
RT_NODE_BASE_4 base;
/* number of values depends on size class */
- uint64 values[FLEXIBLE_ARRAY_MEMBER];
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
} RT_NODE_LEAF_4;
typedef struct RT_NODE_INNER_32
@@ -451,7 +452,7 @@ typedef struct RT_NODE_LEAF_32
RT_NODE_BASE_32 base;
/* number of values depends on size class */
- uint64 values[FLEXIBLE_ARRAY_MEMBER];
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
} RT_NODE_LEAF_32;
typedef struct RT_NODE_INNER_125
@@ -467,7 +468,7 @@ typedef struct RT_NODE_LEAF_125
RT_NODE_BASE_125 base;
/* number of values depends on size class */
- uint64 values[FLEXIBLE_ARRAY_MEMBER];
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
} RT_NODE_LEAF_125;
/*
@@ -490,7 +491,7 @@ typedef struct RT_NODE_LEAF_256
bitmapword isset[BM_IDX(RT_NODE_MAX_SLOTS)];
/* Slots for 256 values */
- uint64 values[RT_NODE_MAX_SLOTS];
+ RT_VALUE_TYPE values[RT_NODE_MAX_SLOTS];
} RT_NODE_LEAF_256;
/* Information for each size class */
@@ -520,33 +521,33 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
.name = "radix tree node 4",
.fanout = 4,
.inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
- .leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64),
+ .leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(RT_VALUE_TYPE),
.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(RT_VALUE_TYPE)),
},
[RT_CLASS_32_PARTIAL] = {
.name = "radix tree node 15",
.fanout = 15,
.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
- .leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE)),
},
[RT_CLASS_32_FULL] = {
.name = "radix tree node 32",
.fanout = 32,
.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
- .leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE)),
},
[RT_CLASS_125_FULL] = {
.name = "radix tree node 125",
.fanout = 125,
.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
- .leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64),
+ .leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64)),
+ .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE)),
},
[RT_CLASS_256] = {
.name = "radix tree node 256",
@@ -648,7 +649,7 @@ typedef struct RT_ITER
static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
uint64 key, RT_PTR_ALLOC child);
static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
- uint64 key, uint64 value);
+ uint64 key, RT_VALUE_TYPE value);
/* verification (available only with assertion) */
static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
@@ -828,10 +829,10 @@ RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count,
}
static inline void
-RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, uint64 *values, int count, int idx)
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
{
memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
- memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
}
/* Delete the element at 'idx' */
@@ -843,10 +844,10 @@ RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count,
}
static inline void
-RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, uint64 *values, int count, int idx)
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
{
memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
- memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
}
/* Copy both chunks and children/values arrays */
@@ -863,12 +864,12 @@ RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
}
static inline void
-RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, uint64 *src_values,
- uint8 *dst_chunks, uint64 *dst_values)
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+ uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
{
const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
- const Size values_size = sizeof(uint64) * fanout;
+ const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
memcpy(dst_chunks, src_chunks, chunk_size);
memcpy(dst_values, src_values, values_size);
@@ -890,7 +891,7 @@ RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
return node->children[node->base.slot_idxs[chunk]];
}
-static inline uint64
+static inline RT_VALUE_TYPE
RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
{
Assert(NODE_IS_LEAF(node));
@@ -926,7 +927,7 @@ RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
return node->children[chunk];
}
-static inline uint64
+static inline RT_VALUE_TYPE
RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
{
Assert(NODE_IS_LEAF(node));
@@ -944,7 +945,7 @@ RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
/* Set the value in the node-256 */
static inline void
-RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, uint64 value)
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
{
int idx = BM_IDX(chunk);
int bitnum = BM_BIT(chunk);
@@ -1215,7 +1216,7 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
* Insert inner and leaf nodes from 'node' to bottom.
*/
static inline void
-RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value, RT_PTR_LOCAL parent,
RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
{
int shift = node->shift;
@@ -1266,7 +1267,7 @@ RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
* to the value is set to value_p.
*/
static inline bool
-RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, uint64 *value_p)
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
{
#define RT_NODE_LEVEL_LEAF
#include "lib/radixtree_search_impl.h"
@@ -1320,7 +1321,7 @@ RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stor
/* Like, RT_NODE_INSERT_INNER, but for leaf nodes */
static bool
RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
- uint64 key, uint64 value)
+ uint64 key, RT_VALUE_TYPE value)
{
#define RT_NODE_LEVEL_LEAF
#include "lib/radixtree_insert_impl.h"
@@ -1522,7 +1523,7 @@ RT_FREE(RT_RADIX_TREE *tree)
* and return true. Returns false if entry doesn't yet exist.
*/
RT_SCOPE bool
-RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
{
int shift;
bool updated;
@@ -1582,7 +1583,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
* not be NULL.
*/
RT_SCOPE bool
-RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
{
RT_PTR_LOCAL node;
int shift;
@@ -1730,7 +1731,7 @@ RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
*/
static inline bool
RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
- uint64 *value_p)
+ RT_VALUE_TYPE *value_p)
{
#define RT_NODE_LEVEL_LEAF
#include "lib/radixtree_iter_impl.h"
@@ -1803,7 +1804,7 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
* return false.
*/
RT_SCOPE bool
-RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
{
/* Empty tree */
if (!iter->tree->ctl->root)
@@ -1812,7 +1813,7 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
for (;;)
{
RT_PTR_LOCAL child = NULL;
- uint64 value;
+ RT_VALUE_TYPE value;
int level;
bool found;
@@ -1971,6 +1972,7 @@ RT_STATS(RT_RADIX_TREE *tree)
tree->ctl->cnt[RT_CLASS_256])));
}
+/* XXX For display, assumes value type is numeric */
static void
RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
{
@@ -1998,7 +2000,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
- space, n4->base.chunks[i], n4->values[i]);
+ space, n4->base.chunks[i], (uint64) n4->values[i]);
}
else
{
@@ -2024,7 +2026,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
- space, n32->base.chunks[i], n32->values[i]);
+ space, n32->base.chunks[i], (uint64) n32->values[i]);
}
else
{
@@ -2077,7 +2079,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
- space, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+ space, i, (uint64) RT_NODE_LEAF_125_GET_VALUE(n125, i));
}
else
{
@@ -2107,7 +2109,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
continue;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
- space, i, RT_NODE_LEAF_256_GET_VALUE(n256, i));
+ space, i, (uint64) RT_NODE_LEAF_256_GET_VALUE(n256, i));
}
else
{
@@ -2213,6 +2215,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_SCOPE
#undef RT_DECLARE
#undef RT_DEFINE
+#undef RT_VALUE_TYPE
/* locally declared macros */
#undef NODE_IS_LEAF
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index eb87866b90..2612730481 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -33,7 +33,7 @@
return false;
#ifdef RT_NODE_LEVEL_LEAF
- RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, (uint64 *) n4->values,
+ RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, n4->values,
n4->base.n.count, idx);
#else
RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
@@ -50,7 +50,7 @@
return false;
#ifdef RT_NODE_LEVEL_LEAF
- RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, (uint64 *) n32->values,
+ RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
n32->base.n.count, idx);
#else
RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index 0b8b68df6c..5c06f8b414 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -16,7 +16,7 @@
uint8 key_chunk;
#ifdef RT_NODE_LEVEL_LEAF
- uint64 value;
+ RT_VALUE_TYPE value;
Assert(NODE_IS_LEAF(node_iter->node));
#else
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index 31e4978e4f..365abaa46d 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -15,7 +15,7 @@
uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
#ifdef RT_NODE_LEVEL_LEAF
- uint64 value = 0;
+ RT_VALUE_TYPE value = 0;
Assert(NODE_IS_LEAF(node));
#else
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index d8323f587f..64d46dfe9a 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -24,6 +24,12 @@
#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
/*
* If you enable this, the "pattern" tests will print information about
* how long populating, probing, and iterating the test set takes, and
@@ -105,6 +111,7 @@ static const test_spec test_specs[] = {
#define RT_DECLARE
#define RT_DEFINE
#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
// WIP: compiles with warnings because rt_attach is defined but not used
// #define RT_SHMEM
#include "lib/radixtree.h"
@@ -128,9 +135,9 @@ test_empty(void)
{
rt_radix_tree *radixtree;
rt_iter *iter;
- uint64 dummy;
+ TestValueType dummy;
uint64 key;
- uint64 val;
+ TestValueType val;
#ifdef RT_SHMEM
int tranche_id = LWLockNewTrancheId();
@@ -202,26 +209,26 @@ test_basic(int children, bool test_inner)
/* insert keys */
for (int i = 0; i < children; i++)
{
- if (rt_set(radixtree, keys[i], keys[i]))
+ if (rt_set(radixtree, keys[i], (TestValueType) keys[i]))
elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
}
/* look up keys */
for (int i = 0; i < children; i++)
{
- uint64 value;
+ TestValueType value;
if (!rt_search(radixtree, keys[i], &value))
elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
- if (value != keys[i])
+ if (value != (TestValueType) keys[i])
elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
- value, keys[i]);
+ value, (TestValueType) keys[i]);
}
/* update keys */
for (int i = 0; i < children; i++)
{
- if (!rt_set(radixtree, keys[i], keys[i] + 1))
+ if (!rt_set(radixtree, keys[i], (TestValueType) (keys[i] + 1)))
elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
}
@@ -230,7 +237,7 @@ test_basic(int children, bool test_inner)
{
if (!rt_delete(radixtree, keys[i]))
elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
- if (rt_set(radixtree, keys[i], keys[i]))
+ if (rt_set(radixtree, keys[i], (TestValueType) keys[i]))
elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
}
@@ -248,12 +255,12 @@ check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
for (int i = start; i < end; i++)
{
uint64 key = ((uint64) i << shift);
- uint64 val;
+ TestValueType val;
if (!rt_search(radixtree, key, &val))
elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
key, end);
- if (val != key)
+ if (val != (TestValueType) key)
elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
key, val, key);
}
@@ -274,7 +281,7 @@ test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
uint64 key = ((uint64) i << shift);
bool found;
- found = rt_set(radixtree, key, key);
+ found = rt_set(radixtree, key, (TestValueType) key);
if (found)
elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
@@ -440,7 +447,7 @@ test_pattern(const test_spec * spec)
x = last_int + pattern_values[i];
- found = rt_set(radixtree, x, x);
+ found = rt_set(radixtree, x, (TestValueType) x);
if (found)
elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
@@ -495,7 +502,7 @@ test_pattern(const test_spec * spec)
bool found;
bool expected;
uint64 x;
- uint64 v;
+ TestValueType v;
/*
* Pick next value to probe at random. We limit the probes to the
@@ -526,7 +533,7 @@ test_pattern(const test_spec * spec)
if (found != expected)
elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
- if (found && (v != x))
+ if (found && (v != (TestValueType) x))
elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
v, x);
}
@@ -549,7 +556,7 @@ test_pattern(const test_spec * spec)
{
uint64 expected = last_int + pattern_values[i];
uint64 x;
- uint64 val;
+ TestValueType val;
if (!rt_iterate_next(iter, &x, &val))
break;
@@ -558,7 +565,7 @@ test_pattern(const test_spec * spec)
elog(ERROR,
"iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
x, expected, i);
- if (val != expected)
+ if (val != (TestValueType) expected)
elog(ERROR,
"iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
n++;
@@ -588,7 +595,7 @@ test_pattern(const test_spec * spec)
{
bool found;
uint64 x;
- uint64 v;
+ TestValueType v;
/*
* Pick next value to probe at random. We limit the probes to the
--
2.39.0
v22-0006-Free-all-radix-tree-nodes-recursively.patchtext/x-patch; charset=US-ASCII; name=v22-0006-Free-all-radix-tree-nodes-recursively.patchDownload
From fe4ed7bf8033453b1ba38b6d298aa519fbe5b9f8 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 20 Jan 2023 12:38:54 +0700
Subject: [PATCH v22 06/22] Free all radix tree nodes recursively
TODO: Consider adding more general functionality to DSA
to free all segments.
---
src/include/lib/radixtree.h | 78 +++++++++++++++++++++++++++++++++++++
1 file changed, 78 insertions(+)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index c08016de3a..98e4597eac 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -127,6 +127,7 @@
#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
#define RT_INIT_NODE RT_MAKE_NAME(init_node)
#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
#define RT_EXTEND RT_MAKE_NAME(extend)
#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
@@ -1410,6 +1411,78 @@ RT_GET_HANDLE(RT_RADIX_TREE *tree)
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
return tree->ctl->handle;
}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static inline void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();
+
+ /* The leaf node doesn't have child pointers */
+ if (NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->dsa, ptr);
+ return;
+ }
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_4:
+ {
+ RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+ for (int i = 0; i < n4->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n4->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ for (int i = 0; i < n32->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+ }
+
+ break;
+ }
+ }
+
+ /* Free the inner node */
+ dsa_free(tree->dsa, ptr);
+}
#endif
/*
@@ -1421,6 +1494,10 @@ RT_FREE(RT_RADIX_TREE *tree)
#ifdef RT_SHMEM
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ /* Free all memory used for radix tree nodes */
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_FREE_RECURSE(tree, tree->ctl->root);
+
/*
* Vandalize the control block to help catch programming error where
* other backends access the memory formerly occupied by this radix tree.
@@ -2199,6 +2276,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_ALLOC_NODE
#undef RT_INIT_NODE
#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
#undef RT_EXTEND
#undef RT_SET_EXTEND
#undef RT_SWITCH_NODE_KIND
--
2.39.0
v22-0008-Streamline-calculation-of-slab-blocksize.patchtext/x-patch; charset=US-ASCII; name=v22-0008-Streamline-calculation-of-slab-blocksize.patchDownload
From 26d69b070472d5e2af3a87565d900dad91b273e8 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Jan 2023 14:55:25 +0700
Subject: [PATCH v22 08/22] Streamline calculation of slab blocksize
To reduce duplication. This will likely lead to
division instructions, but a few cycles won't
matter at all when creating the tree.
---
src/include/lib/radixtree.h | 50 ++++++++++++++-----------------------
1 file changed, 19 insertions(+), 31 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 0a39bd6664..172d62c6b0 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -304,6 +304,13 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
#define RT_NODE_KIND_256 0x03
#define RT_NODE_KIND_COUNT 4
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
#endif /* RT_COMMON */
@@ -503,59 +510,38 @@ typedef struct RT_SIZE_CLASS_ELEM
/* slab chunk size */
Size inner_size;
Size leaf_size;
-
- /* slab block size */
- Size inner_blocksize;
- Size leaf_blocksize;
} RT_SIZE_CLASS_ELEM;
-/*
- * Calculate the slab blocksize so that we can allocate at least 32 chunks
- * from the block.
- */
-#define NODE_SLAB_BLOCK_SIZE(size) \
- Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
-
static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
[RT_CLASS_4_FULL] = {
.name = "radix tree node 4",
.fanout = 4,
.inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
.leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(RT_VALUE_TYPE),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(RT_VALUE_TYPE)),
},
[RT_CLASS_32_PARTIAL] = {
.name = "radix tree node 15",
.fanout = 15,
.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE)),
},
[RT_CLASS_32_FULL] = {
.name = "radix tree node 32",
.fanout = 32,
.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE)),
},
[RT_CLASS_125_FULL] = {
.name = "radix tree node 125",
.fanout = 125,
.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE)),
},
[RT_CLASS_256] = {
.name = "radix tree node 256",
.fanout = 256,
.inner_size = sizeof(RT_NODE_INNER_256),
.leaf_size = sizeof(RT_NODE_LEAF_256),
- .inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_256)),
- .leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_256)),
},
};
@@ -1361,14 +1347,18 @@ RT_CREATE(MemoryContext ctx)
/* Create the slab allocator for each size class */
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
+ RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+ size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+ size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
tree->inner_slabs[i] = SlabContextCreate(ctx,
- RT_SIZE_CLASS_INFO[i].name,
- RT_SIZE_CLASS_INFO[i].inner_blocksize,
- RT_SIZE_CLASS_INFO[i].inner_size);
+ size_class.name,
+ inner_blocksize,
+ size_class.inner_size);
tree->leaf_slabs[i] = SlabContextCreate(ctx,
- RT_SIZE_CLASS_INFO[i].name,
- RT_SIZE_CLASS_INFO[i].leaf_blocksize,
- RT_SIZE_CLASS_INFO[i].leaf_size);
+ size_class.name,
+ leaf_blocksize,
+ size_class.leaf_size);
}
#endif
@@ -2189,12 +2179,10 @@ RT_DUMP(RT_RADIX_TREE *tree)
{
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
- fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+ fprintf(stderr, "%s\tinner_size %zu\tleaf_size %zu\t%zu\n",
RT_SIZE_CLASS_INFO[i].name,
RT_SIZE_CLASS_INFO[i].inner_size,
- RT_SIZE_CLASS_INFO[i].inner_blocksize,
- RT_SIZE_CLASS_INFO[i].leaf_size,
- RT_SIZE_CLASS_INFO[i].leaf_blocksize);
+ RT_SIZE_CLASS_INFO[i].leaf_size);
fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
if (!tree->ctl->root)
--
2.39.0
v22-0009-Remove-hard-coded-128.patchtext/x-patch; charset=US-ASCII; name=v22-0009-Remove-hard-coded-128.patchDownload
From e3c3cae8de8db407334aa5f16d187b69baea6279 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Jan 2023 15:51:21 +0700
Subject: [PATCH v22 09/22] Remove hard-coded 128
Also comment that 64 could be a valid number of bits
in the bitmap for this node type.
TODO: Consider whether we should in fact limit this
node to ~64.
In passing, remove "125" from invalid-slot-index macro.
---
src/include/lib/radixtree.h | 19 +++++++++++++------
src/include/lib/radixtree_delete_impl.h | 4 ++--
src/include/lib/radixtree_insert_impl.h | 4 ++--
src/include/lib/radixtree_search_impl.h | 4 ++--
4 files changed, 19 insertions(+), 12 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 172d62c6b0..d15ea8f0fe 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -270,8 +270,15 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
/* Tree level the radix tree uses */
#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
/* Invalid index used in node-125 */
-#define RT_NODE_125_INVALID_IDX 0xFF
+#define RT_INVALID_SLOT_IDX 0xFF
/* Get a chunk from the key */
#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
@@ -409,7 +416,7 @@ typedef struct RT_NODE_BASE_125
uint8 slot_idxs[RT_NODE_MAX_SLOTS];
/* isset is a bitmap to track which slot is in use */
- bitmapword isset[BM_IDX(128)];
+ bitmapword isset[BM_IDX(RT_SLOT_IDX_LIMIT)];
} RT_NODE_BASE_125;
typedef struct RT_NODE_BASE_256
@@ -867,7 +874,7 @@ RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
static inline bool
RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
{
- return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+ return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
}
static inline RT_PTR_ALLOC
@@ -881,7 +888,7 @@ static inline RT_VALUE_TYPE
RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
{
Assert(NODE_IS_LEAF(node));
- Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+ Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
return node->values[node->base.slot_idxs[chunk]];
}
@@ -1037,7 +1044,7 @@ RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner
{
RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
- memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+ memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
}
}
@@ -2052,7 +2059,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
fprintf(stderr, ", isset-bitmap:");
- for (int i = 0; i < BM_IDX(128); i++)
+ for (int i = 0; i < BM_IDX(RT_SLOT_IDX_LIMIT); i++)
{
fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
}
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index 2612730481..2f1c172672 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -65,13 +65,13 @@
int idx;
int bitnum;
- if (slotpos == RT_NODE_125_INVALID_IDX)
+ if (slotpos == RT_INVALID_SLOT_IDX)
return false;
idx = BM_IDX(slotpos);
bitnum = BM_BIT(slotpos);
n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
- n125->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+ n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
break;
}
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index e3e44669ea..90fe5f539e 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -201,7 +201,7 @@
int slotpos = n125->base.slot_idxs[chunk];
int cnt = 0;
- if (slotpos != RT_NODE_125_INVALID_IDX)
+ if (slotpos != RT_INVALID_SLOT_IDX)
{
/* found the existing chunk */
chunk_exists = true;
@@ -247,7 +247,7 @@
bitmapword inverse;
/* get the first word with at least one bit not set */
- for (idx = 0; idx < BM_IDX(128); idx++)
+ for (idx = 0; idx < BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
{
if (n125->base.isset[idx] < ~((bitmapword) 0))
break;
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index 365abaa46d..d2bbdd2450 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -73,10 +73,10 @@
int slotpos = n125->base.slot_idxs[chunk];
#ifdef RT_ACTION_UPDATE
- Assert(slotpos != RT_NODE_125_INVALID_IDX);
+ Assert(slotpos != RT_INVALID_SLOT_IDX);
n125->children[slotpos] = new_child;
#else
- if (slotpos == RT_NODE_125_INVALID_IDX)
+ if (slotpos == RT_INVALID_SLOT_IDX)
return false;
#ifdef RT_NODE_LEVEL_LEAF
--
2.39.0
v22-0010-Reduce-node4-to-node3.patchtext/x-patch; charset=US-ASCII; name=v22-0010-Reduce-node4-to-node3.patchDownload
From dfa2aece9d83cc6e9ab791c6b1641aca1d02d8f6 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Jan 2023 18:05:15 +0700
Subject: [PATCH v22 10/22] Reduce node4 to node3
Now that we don't store "chunk", the base node type is only
5 bytes in size. With 3 key chunks, There is no alignment
padding between the chunks array and the child/value array.
This reduces the smallest inner node to 32 bytes on 64-bit
platforms.
---
src/include/lib/radixtree.h | 124 ++++++++++++------------
src/include/lib/radixtree_delete_impl.h | 20 ++--
src/include/lib/radixtree_insert_impl.h | 38 ++++----
src/include/lib/radixtree_iter_impl.h | 18 ++--
src/include/lib/radixtree_search_impl.h | 18 ++--
5 files changed, 109 insertions(+), 109 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d15ea8f0fe..6cc8442c89 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -136,9 +136,9 @@
#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
-#define RT_NODE_4_SEARCH_EQ RT_MAKE_NAME(node_4_search_eq)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
-#define RT_NODE_4_GET_INSERTPOS RT_MAKE_NAME(node_4_get_insertpos)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
@@ -181,22 +181,22 @@
#endif
#define RT_NODE RT_MAKE_NAME(node)
#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
-#define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
-#define RT_NODE_INNER_4 RT_MAKE_NAME(node_inner_4)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
-#define RT_NODE_LEAF_4 RT_MAKE_NAME(node_leaf_4)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
-#define RT_CLASS_4_FULL RT_MAKE_NAME(class_4_full)
+#define RT_CLASS_3_FULL RT_MAKE_NAME(class_3_full)
#define RT_CLASS_32_PARTIAL RT_MAKE_NAME(class_32_partial)
#define RT_CLASS_32_FULL RT_MAKE_NAME(class_32_full)
#define RT_CLASS_125_FULL RT_MAKE_NAME(class_125_full)
@@ -305,7 +305,7 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
* allocator padding in both the inner and leaf nodes on DSA.
* node
*/
-#define RT_NODE_KIND_4 0x00
+#define RT_NODE_KIND_3 0x00
#define RT_NODE_KIND_32 0x01
#define RT_NODE_KIND_125 0x02
#define RT_NODE_KIND_256 0x03
@@ -323,7 +323,7 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
typedef enum RT_SIZE_CLASS
{
- RT_CLASS_4_FULL = 0,
+ RT_CLASS_3_FULL = 0,
RT_CLASS_32_PARTIAL,
RT_CLASS_32_FULL,
RT_CLASS_125_FULL,
@@ -387,13 +387,13 @@ typedef struct RT_NODE
/* Base type of each node kinds for leaf and inner nodes */
/* The base types must be a be able to accommodate the largest size
class for variable-sized node kinds*/
-typedef struct RT_NODE_BASE_4
+typedef struct RT_NODE_BASE_3
{
RT_NODE n;
- /* 4 children, for key chunks */
- uint8 chunks[4];
-} RT_NODE_BASE_4;
+ /* 3 children, for key chunks */
+ uint8 chunks[3];
+} RT_NODE_BASE_3;
typedef struct RT_NODE_BASE_32
{
@@ -437,21 +437,21 @@ typedef struct RT_NODE_BASE_256
* good. It might be better to just indicate non-existing entries the same way
* in inner nodes.
*/
-typedef struct RT_NODE_INNER_4
+typedef struct RT_NODE_INNER_3
{
- RT_NODE_BASE_4 base;
+ RT_NODE_BASE_3 base;
/* number of children depends on size class */
RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
-} RT_NODE_INNER_4;
+} RT_NODE_INNER_3;
-typedef struct RT_NODE_LEAF_4
+typedef struct RT_NODE_LEAF_3
{
- RT_NODE_BASE_4 base;
+ RT_NODE_BASE_3 base;
/* number of values depends on size class */
RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
-} RT_NODE_LEAF_4;
+} RT_NODE_LEAF_3;
typedef struct RT_NODE_INNER_32
{
@@ -520,11 +520,11 @@ typedef struct RT_SIZE_CLASS_ELEM
} RT_SIZE_CLASS_ELEM;
static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
- [RT_CLASS_4_FULL] = {
- .name = "radix tree node 4",
- .fanout = 4,
- .inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
- .leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(RT_VALUE_TYPE),
+ [RT_CLASS_3_FULL] = {
+ .name = "radix tree node 3",
+ .fanout = 3,
+ .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
},
[RT_CLASS_32_PARTIAL] = {
.name = "radix tree node 15",
@@ -556,7 +556,7 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
/* Map from the node kind to its minimum size class */
static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
- [RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+ [RT_NODE_KIND_3] = RT_CLASS_3_FULL,
[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
[RT_NODE_KIND_256] = RT_CLASS_256,
@@ -673,7 +673,7 @@ RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
* if there is no such element.
*/
static inline int
-RT_NODE_4_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
{
int idx = -1;
@@ -693,7 +693,7 @@ RT_NODE_4_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
* Return index of the chunk to insert into chunks in the given node.
*/
static inline int
-RT_NODE_4_GET_INSERTPOS(RT_NODE_BASE_4 *node, uint8 chunk)
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
{
int idx;
@@ -810,7 +810,7 @@ RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
/*
* Functions to manipulate both chunks array and children/values array.
- * These are used for node-4 and node-32.
+ * These are used for node-3 and node-32.
*/
/* Shift the elements right at 'idx' by one */
@@ -848,7 +848,7 @@ static inline void
RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
{
- const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_FULL].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
@@ -860,7 +860,7 @@ static inline void
RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
{
- const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_FULL].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
@@ -1060,9 +1060,9 @@ RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL newnode;
- allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, inner);
newnode = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3_FULL, inner);
newnode->shift = shift;
tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
tree->ctl->root = allocnode;
@@ -1183,17 +1183,17 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
{
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL node;
- RT_NODE_INNER_4 *n4;
+ RT_NODE_INNER_3 *n3;
- allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, true);
node = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+ RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3_FULL, true);
node->shift = shift;
node->count = 1;
- n4 = (RT_NODE_INNER_4 *) node;
- n4->base.chunks[0] = 0;
- n4->children[0] = tree->ctl->root;
+ n3 = (RT_NODE_INNER_3 *) node;
+ n3->base.chunks[0] = 0;
+ n3->children[0] = tree->ctl->root;
/* Update the root */
tree->ctl->root = allocnode;
@@ -1223,9 +1223,9 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value, RT_PTR_LOCAL
int newshift = shift - RT_NODE_SPAN;
bool inner = newshift > 0;
- allocchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, inner);
newchild = RT_PTR_GET_LOCAL(tree, allocchild);
- RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3_FULL, inner);
newchild->shift = newshift;
RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
@@ -1430,12 +1430,12 @@ RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
switch (node->kind)
{
- case RT_NODE_KIND_4:
+ case RT_NODE_KIND_3:
{
- RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
- for (int i = 0; i < n4->base.n.count; i++)
- RT_FREE_RECURSE(tree, n4->children[i]);
+ for (int i = 0; i < n3->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n3->children[i]);
break;
}
@@ -1892,12 +1892,12 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
switch (node->kind)
{
- case RT_NODE_KIND_4:
+ case RT_NODE_KIND_3:
{
- RT_NODE_BASE_4 *n4 = (RT_NODE_BASE_4 *) node;
+ RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
- for (int i = 1; i < n4->n.count; i++)
- Assert(n4->chunks[i - 1] < n4->chunks[i]);
+ for (int i = 1; i < n3->n.count; i++)
+ Assert(n3->chunks[i - 1] < n3->chunks[i]);
break;
}
@@ -1959,10 +1959,10 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
RT_SCOPE void
RT_STATS(RT_RADIX_TREE *tree)
{
- ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+ ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
tree->ctl->num_keys,
tree->ctl->root->shift / RT_NODE_SPAN,
- tree->ctl->cnt[RT_CLASS_4_FULL],
+ tree->ctl->cnt[RT_CLASS_3_FULL],
tree->ctl->cnt[RT_CLASS_32_PARTIAL],
tree->ctl->cnt[RT_CLASS_32_FULL],
tree->ctl->cnt[RT_CLASS_125_FULL],
@@ -1977,7 +1977,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
NODE_IS_LEAF(node) ? "LEAF" : "INNR",
- (node->kind == RT_NODE_KIND_4) ? 4 :
+ (node->kind == RT_NODE_KIND_3) ? 3 :
(node->kind == RT_NODE_KIND_32) ? 32 :
(node->kind == RT_NODE_KIND_125) ? 125 : 256,
node->fanout == 0 ? 256 : node->fanout,
@@ -1988,26 +1988,26 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
switch (node->kind)
{
- case RT_NODE_KIND_4:
+ case RT_NODE_KIND_3:
{
for (int i = 0; i < node->count; i++)
{
if (NODE_IS_LEAF(node))
{
- RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
+ RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
- space, n4->base.chunks[i], (uint64) n4->values[i]);
+ space, n3->base.chunks[i], (uint64) n3->values[i]);
}
else
{
- RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
fprintf(stderr, "%schunk 0x%X ->",
- space, n4->base.chunks[i]);
+ space, n3->base.chunks[i]);
if (recurse)
- RT_DUMP_NODE(n4->children[i], level + 1, recurse);
+ RT_DUMP_NODE(n3->children[i], level + 1, recurse);
else
fprintf(stderr, "\n");
}
@@ -2229,22 +2229,22 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_ITER
#undef RT_NODE
#undef RT_NODE_ITER
-#undef RT_NODE_BASE_4
+#undef RT_NODE_BASE_3
#undef RT_NODE_BASE_32
#undef RT_NODE_BASE_125
#undef RT_NODE_BASE_256
-#undef RT_NODE_INNER_4
+#undef RT_NODE_INNER_3
#undef RT_NODE_INNER_32
#undef RT_NODE_INNER_125
#undef RT_NODE_INNER_256
-#undef RT_NODE_LEAF_4
+#undef RT_NODE_LEAF_3
#undef RT_NODE_LEAF_32
#undef RT_NODE_LEAF_125
#undef RT_NODE_LEAF_256
#undef RT_SIZE_CLASS
#undef RT_SIZE_CLASS_ELEM
#undef RT_SIZE_CLASS_INFO
-#undef RT_CLASS_4_FULL
+#undef RT_CLASS_3_FULL
#undef RT_CLASS_32_PARTIAL
#undef RT_CLASS_32_FULL
#undef RT_CLASS_125_FULL
@@ -2282,9 +2282,9 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_REPLACE_NODE
#undef RT_PTR_GET_LOCAL
#undef RT_PTR_ALLOC_IS_VALID
-#undef RT_NODE_4_SEARCH_EQ
+#undef RT_NODE_3_SEARCH_EQ
#undef RT_NODE_32_SEARCH_EQ
-#undef RT_NODE_4_GET_INSERTPOS
+#undef RT_NODE_3_GET_INSERTPOS
#undef RT_NODE_32_GET_INSERTPOS
#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
#undef RT_CHUNK_VALUES_ARRAY_SHIFT
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index 2f1c172672..b9f07f4eb5 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -1,12 +1,12 @@
/* TODO: shrink nodes */
#if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE3_TYPE RT_NODE_INNER_3
#define RT_NODE32_TYPE RT_NODE_INNER_32
#define RT_NODE125_TYPE RT_NODE_INNER_125
#define RT_NODE256_TYPE RT_NODE_INNER_256
#elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
#define RT_NODE32_TYPE RT_NODE_LEAF_32
#define RT_NODE125_TYPE RT_NODE_LEAF_125
#define RT_NODE256_TYPE RT_NODE_LEAF_256
@@ -24,20 +24,20 @@
switch (node->kind)
{
- case RT_NODE_KIND_4:
+ case RT_NODE_KIND_3:
{
- RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
- int idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
if (idx < 0)
return false;
#ifdef RT_NODE_LEVEL_LEAF
- RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, n4->values,
- n4->base.n.count, idx);
+ RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+ n3->base.n.count, idx);
#else
- RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
- n4->base.n.count, idx);
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+ n3->base.n.count, idx);
#endif
break;
}
@@ -100,7 +100,7 @@
return true;
-#undef RT_NODE4_TYPE
+#undef RT_NODE3_TYPE
#undef RT_NODE32_TYPE
#undef RT_NODE125_TYPE
#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 90fe5f539e..16461bdb03 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -1,10 +1,10 @@
#if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE3_TYPE RT_NODE_INNER_3
#define RT_NODE32_TYPE RT_NODE_INNER_32
#define RT_NODE125_TYPE RT_NODE_INNER_125
#define RT_NODE256_TYPE RT_NODE_INNER_256
#elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
#define RT_NODE32_TYPE RT_NODE_LEAF_32
#define RT_NODE125_TYPE RT_NODE_LEAF_125
#define RT_NODE256_TYPE RT_NODE_LEAF_256
@@ -25,25 +25,25 @@
switch (node->kind)
{
- case RT_NODE_KIND_4:
+ case RT_NODE_KIND_3:
{
- RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
int idx;
- idx = RT_NODE_4_SEARCH_EQ(&n4->base, chunk);
+ idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
if (idx != -1)
{
/* found the existing chunk */
chunk_exists = true;
#ifdef RT_NODE_LEVEL_LEAF
- n4->values[idx] = value;
+ n3->values[idx] = value;
#else
- n4->children[idx] = child;
+ n3->children[idx] = child;
#endif
break;
}
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+ if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n3)))
{
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL newnode;
@@ -51,16 +51,16 @@
const uint8 new_kind = RT_NODE_KIND_32;
const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
- /* grow node from 4 to 32 */
+ /* grow node from 3 to 32 */
allocnode = RT_ALLOC_NODE(tree, new_class, inner);
newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
new32 = (RT_NODE32_TYPE *) newnode;
#ifdef RT_NODE_LEVEL_LEAF
- RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
+ RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
new32->base.chunks, new32->values);
#else
- RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
+ RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
new32->base.chunks, new32->children);
#endif
RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
@@ -68,27 +68,27 @@
}
else
{
- int insertpos = RT_NODE_4_GET_INSERTPOS(&n4->base, chunk);
- int count = n4->base.n.count;
+ int insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+ int count = n3->base.n.count;
/* shift chunks and children */
if (insertpos < count)
{
Assert(count > 0);
#ifdef RT_NODE_LEVEL_LEAF
- RT_CHUNK_VALUES_ARRAY_SHIFT(n4->base.chunks, n4->values,
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
count, insertpos);
#else
- RT_CHUNK_CHILDREN_ARRAY_SHIFT(n4->base.chunks, n4->children,
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
count, insertpos);
#endif
}
- n4->base.chunks[insertpos] = chunk;
+ n3->base.chunks[insertpos] = chunk;
#ifdef RT_NODE_LEVEL_LEAF
- n4->values[insertpos] = value;
+ n3->values[insertpos] = value;
#else
- n4->children[insertpos] = child;
+ n3->children[insertpos] = child;
#endif
break;
}
@@ -304,7 +304,7 @@
return chunk_exists;
-#undef RT_NODE4_TYPE
+#undef RT_NODE3_TYPE
#undef RT_NODE32_TYPE
#undef RT_NODE125_TYPE
#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index 5c06f8b414..c428531438 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -1,10 +1,10 @@
#if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE3_TYPE RT_NODE_INNER_3
#define RT_NODE32_TYPE RT_NODE_INNER_32
#define RT_NODE125_TYPE RT_NODE_INNER_125
#define RT_NODE256_TYPE RT_NODE_INNER_256
#elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
#define RT_NODE32_TYPE RT_NODE_LEAF_32
#define RT_NODE125_TYPE RT_NODE_LEAF_125
#define RT_NODE256_TYPE RT_NODE_LEAF_256
@@ -31,19 +31,19 @@
switch (node_iter->node->kind)
{
- case RT_NODE_KIND_4:
+ case RT_NODE_KIND_3:
{
- RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node_iter->node;
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
node_iter->current_idx++;
- if (node_iter->current_idx >= n4->base.n.count)
+ if (node_iter->current_idx >= n3->base.n.count)
break;
#ifdef RT_NODE_LEVEL_LEAF
- value = n4->values[node_iter->current_idx];
+ value = n3->values[node_iter->current_idx];
#else
- child = RT_PTR_GET_LOCAL(iter->tree, n4->children[node_iter->current_idx]);
+ child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
#endif
- key_chunk = n4->base.chunks[node_iter->current_idx];
+ key_chunk = n3->base.chunks[node_iter->current_idx];
found = true;
break;
}
@@ -132,7 +132,7 @@
return child;
#endif
-#undef RT_NODE4_TYPE
+#undef RT_NODE3_TYPE
#undef RT_NODE32_TYPE
#undef RT_NODE125_TYPE
#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index d2bbdd2450..31138b6a72 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -1,10 +1,10 @@
#if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE3_TYPE RT_NODE_INNER_3
#define RT_NODE32_TYPE RT_NODE_INNER_32
#define RT_NODE125_TYPE RT_NODE_INNER_125
#define RT_NODE256_TYPE RT_NODE_INNER_256
#elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
#define RT_NODE32_TYPE RT_NODE_LEAF_32
#define RT_NODE125_TYPE RT_NODE_LEAF_125
#define RT_NODE256_TYPE RT_NODE_LEAF_256
@@ -27,22 +27,22 @@
switch (node->kind)
{
- case RT_NODE_KIND_4:
+ case RT_NODE_KIND_3:
{
- RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
- int idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
#ifdef RT_ACTION_UPDATE
Assert(idx >= 0);
- n4->children[idx] = new_child;
+ n3->children[idx] = new_child;
#else
if (idx < 0)
return false;
#ifdef RT_NODE_LEVEL_LEAF
- value = n4->values[idx];
+ value = n3->values[idx];
#else
- child = n4->children[idx];
+ child = n3->children[idx];
#endif
#endif /* RT_ACTION_UPDATE */
break;
@@ -125,7 +125,7 @@
return true;
#endif /* RT_ACTION_UPDATE */
-#undef RT_NODE4_TYPE
+#undef RT_NODE3_TYPE
#undef RT_NODE32_TYPE
#undef RT_NODE125_TYPE
#undef RT_NODE256_TYPE
--
2.39.0
v22-0011-Expand-commentary-for-kinds-vs.-size-classes.patchtext/x-patch; charset=US-ASCII; name=v22-0011-Expand-commentary-for-kinds-vs.-size-classes.patchDownload
From 78faaad01a69a5a81eb219e3f45983c1b466e173 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sat, 21 Jan 2023 12:52:53 +0700
Subject: [PATCH v22 11/22] Expand commentary for kinds vs. size classes
Also move class enum closer to array and add #undef's
---
src/include/lib/radixtree.h | 76 ++++++++++++++++++++++++++-----------
1 file changed, 53 insertions(+), 23 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 6cc8442c89..4a2dad82bf 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -288,22 +288,26 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
#define BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
/*
- * Supported radix tree node kinds and size classes.
+ * Node kinds
*
- * There are 4 node kinds and each node kind have one or two size classes,
- * partial and full. The size classes in the same node kind have the same
- * node structure but have the different number of fanout that is stored
- * in 'fanout' of RT_NODE. For example in size class 15, when a 16th element
- * is to be inserted, we allocate a larger area and memcpy the entire old
- * node to it.
+ * The different node kinds are what make the tree "adaptive".
*
- * This technique allows us to limit the node kinds to 4, which limits the
- * number of cases in switch statements. It also allows a possible future
- * optimization to encode the node kind in a pointer tag.
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
*
- * These size classes have been chose carefully so that it minimizes the
- * allocator padding in both the inner and leaf nodes on DSA.
- * node
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ * statments.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ * in the future to tag the node pointer with the kind, even on
+ * platforms with 32-bit pointers. This might speed up node traversal
+ * in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
*/
#define RT_NODE_KIND_3 0x00
#define RT_NODE_KIND_32 0x01
@@ -320,16 +324,6 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
#endif /* RT_COMMON */
-
-typedef enum RT_SIZE_CLASS
-{
- RT_CLASS_3_FULL = 0,
- RT_CLASS_32_PARTIAL,
- RT_CLASS_32_FULL,
- RT_CLASS_125_FULL,
- RT_CLASS_256
-} RT_SIZE_CLASS;
-
/* Common type for all nodes types */
typedef struct RT_NODE
{
@@ -508,6 +502,37 @@ typedef struct RT_NODE_LEAF_256
RT_VALUE_TYPE values[RT_NODE_MAX_SLOTS];
} RT_NODE_LEAF_256;
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+ RT_CLASS_3_FULL = 0,
+ RT_CLASS_32_PARTIAL,
+ RT_CLASS_32_FULL,
+ RT_CLASS_125_FULL,
+ RT_CLASS_256
+} RT_SIZE_CLASS;
+
/* Information for each size class */
typedef struct RT_SIZE_CLASS_ELEM
{
@@ -2217,6 +2242,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef NODE_IS_EMPTY
#undef VAR_NODE_HAS_FREE_SLOT
#undef FIXED_NODE_HAS_FREE_SLOT
+#undef RT_NODE_KIND_COUNT
#undef RT_SIZE_CLASS_COUNT
#undef RT_RADIX_TREE_MAGIC
@@ -2229,6 +2255,10 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_ITER
#undef RT_NODE
#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
#undef RT_NODE_BASE_3
#undef RT_NODE_BASE_32
#undef RT_NODE_BASE_125
--
2.39.0
v22-0012-Tool-for-measuring-radix-tree-performance.patchtext/x-patch; charset=US-ASCII; name=v22-0012-Tool-for-measuring-radix-tree-performance.patchDownload
From 626a2545ffaaf6e1ee09a502df152fa0597276fa Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v22 12/22] Tool for measuring radix tree performance
Includes Meson support, but commented out to avoid warnings
XXX: Not for commit
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 76 ++
contrib/bench_radix_tree/bench_radix_tree.c | 656 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/meson.build | 33 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
contrib/meson.build | 1 +
8 files changed, 822 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/meson.build
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..2fd689aa91
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..4c785c7336
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,656 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ rt_radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 search_time_ms;
+ Datum values[2] = {0};
+ bool nulls[2] = {0};
+ /* from trial and error */
+ uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (!PG_ARGISNULL(1))
+ {
+ char *filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+ if (sscanf(filter_str, "0x%lX", &filter) == 0)
+ elog(ERROR, "invalid filter string %s", filter_str);
+ }
+ elog(NOTICE, "bench with filter 0x%lX", filter);
+
+ rt = rt_create(CurrentMemoryContext);
+
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+
+ rt_set(rt, key, key);
+ }
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_sparseload_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ uint64 r,
+ h;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ key_id = 0;
+
+ for (r = 1; r <= fanout; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (h);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+
+ rt_stats(rt);
+
+ /* measure sparse deletion and re-loading */
+ start_time = GetCurrentTimestamp();
+
+ for (int t = 0; t<10000; t++)
+ {
+ /* delete one key in each leaf */
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ rt_delete(rt, key);
+ }
+
+ /* add them all back */
+ key_id = 0;
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_sparseload_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+ 'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+ bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'bench_radix_tree',
+ '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+ bench_radix_tree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+ 'bench_radix_tree.control',
+ 'bench_radix_tree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'bench_radix_tree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'bench_radix_tree',
+ ],
+ },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
subdir('auth_delay')
subdir('auto_explain')
subdir('basic_archive')
+#subdir('bench_radix_tree')
subdir('bloom')
subdir('basebackup_to_shell')
subdir('bool_plperl')
--
2.39.0
v22-0013-Get-rid-of-NODE_IS_EMPTY-macro.patchtext/x-patch; charset=US-ASCII; name=v22-0013-Get-rid-of-NODE_IS_EMPTY-macro.patchDownload
From d9944828bfc3ab39f29b522aadedda6e5d978041 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sat, 21 Jan 2023 13:40:28 +0700
Subject: [PATCH v22 13/22] Get rid of NODE_IS_EMPTY macro
It's already pretty clear what "count == 0" means, and the
existing comments make it obvious.
---
src/include/lib/radixtree.h | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 4a2dad82bf..567eab4bc8 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -372,7 +372,6 @@ typedef struct RT_NODE
#endif
#define NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
-#define NODE_IS_EMPTY(n) (((RT_PTR_LOCAL) (n))->count == 0)
#define VAR_NODE_HAS_FREE_SLOT(node) \
((node)->base.n.count < (node)->base.n.fanout)
#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
@@ -1701,7 +1700,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
* Return if the leaf node still has keys and we don't need to delete the
* node.
*/
- if (!NODE_IS_EMPTY(node))
+ if (node->count > 0)
return true;
/* Free the empty leaf node */
@@ -1717,7 +1716,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
Assert(deleted);
/* If the node didn't become empty, we stop deleting the key */
- if (!NODE_IS_EMPTY(node))
+ if (node->count > 0)
break;
/* The node became empty */
@@ -2239,7 +2238,6 @@ RT_DUMP(RT_RADIX_TREE *tree)
/* locally declared macros */
#undef NODE_IS_LEAF
-#undef NODE_IS_EMPTY
#undef VAR_NODE_HAS_FREE_SLOT
#undef FIXED_NODE_HAS_FREE_SLOT
#undef RT_NODE_KIND_COUNT
--
2.39.0
v22-0014-Add-some-comments-for-insert-logic.patchtext/x-patch; charset=US-ASCII; name=v22-0014-Add-some-comments-for-insert-logic.patchDownload
From dec37d66a36728ea9581ac51b91ab91850ec0e3b Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sat, 21 Jan 2023 14:21:55 +0700
Subject: [PATCH v22 14/22] Add some comments for insert logic
---
src/include/lib/radixtree.h | 29 ++++++++++++++++++++++---
src/include/lib/radixtree_insert_impl.h | 5 +++++
2 files changed, 31 insertions(+), 3 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 567eab4bc8..d48c915373 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -731,8 +731,8 @@ RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
}
/*
- * Return index of the first element in 'base' that equals 'key'. Return -1
- * if there is no such element.
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
*/
static inline int
RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
@@ -762,14 +762,22 @@ RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
#endif
#ifndef USE_NO_SIMD
+ /* replicate the search key */
spread_chunk = vector8_broadcast(chunk);
+
+ /* compare to the 32 keys stored in the node */
vector8_load(&haystack1, &node->chunks[0]);
vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
cmp1 = vector8_eq(spread_chunk, haystack1);
cmp2 = vector8_eq(spread_chunk, haystack2);
+
+ /* convert comparison to a bitfield */
bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+ /* mask off invalid entries */
bitfield &= ((UINT64CONST(1) << count) - 1);
+ /* convert bitfield to index by counting trailing zeros */
if (bitfield)
index_simd = pg_rightmost_one_pos32(bitfield);
@@ -781,7 +789,8 @@ RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
}
/*
- * Return index of the chunk to insert into chunks in the given node.
+ * Return index of the node's chunk array to insert into,
+ * such that the chunk array remains ordered.
*/
static inline int
RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
@@ -804,12 +813,26 @@ RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
for (index = 0; index < count; index++)
{
+ /*
+ * This is coded with '>=' to match what we can do with SIMD,
+ * with an assert to keep us honest.
+ */
if (node->chunks[index] >= chunk)
+ {
+ Assert(node->chunks[index] != chunk);
break;
+ }
}
#endif
#ifndef USE_NO_SIMD
+ /*
+ * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+ * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+ * we need to play some trickery using vector8_min() to effectively get
+ * <=. There'll never be any equal elements in the current uses, but that's
+ * what we get here...
+ */
spread_chunk = vector8_broadcast(chunk);
vector8_load(&haystack1, &node->chunks[0]);
vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 16461bdb03..8470c8fc70 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -162,6 +162,11 @@
#endif
}
+ /*
+ * Since we just copied a dense array, we can set the bits
+ * using a single store, provided the length of that array
+ * is at most the number of bits in a bitmapword.
+ */
Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
--
2.39.0
v22-0015-Get-rid-of-FIXED_NODE_HAS_FREE_SLOT.patchtext/x-patch; charset=US-ASCII; name=v22-0015-Get-rid-of-FIXED_NODE_HAS_FREE_SLOT.patchDownload
From 23527a3d2b725a4f3876125e5f663540ab411e92 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 22 Jan 2023 11:53:33 +0700
Subject: [PATCH v22 15/22] Get rid of FIXED_NODE_HAS_FREE_SLOT
It's only used in one assert for the node256 kind, whose
fanout is necessarily fixed, and we already have a
convenient macro to compare that with.
---
src/include/lib/radixtree.h | 3 ---
src/include/lib/radixtree_insert_impl.h | 2 +-
2 files changed, 1 insertion(+), 4 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d48c915373..8fbc0b5086 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -374,8 +374,6 @@ typedef struct RT_NODE
#define NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
#define VAR_NODE_HAS_FREE_SLOT(node) \
((node)->base.n.count < (node)->base.n.fanout)
-#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
- ((node)->base.n.count < RT_SIZE_CLASS_INFO[class].fanout)
/* Base type of each node kinds for leaf and inner nodes */
/* The base types must be a be able to accommodate the largest size
@@ -2262,7 +2260,6 @@ RT_DUMP(RT_RADIX_TREE *tree)
/* locally declared macros */
#undef NODE_IS_LEAF
#undef VAR_NODE_HAS_FREE_SLOT
-#undef FIXED_NODE_HAS_FREE_SLOT
#undef RT_NODE_KIND_COUNT
#undef RT_SIZE_CLASS_COUNT
#undef RT_RADIX_TREE_MAGIC
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 8470c8fc70..b484b7a099 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -286,7 +286,7 @@
#else
chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
#endif
- Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+ Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
#ifdef RT_NODE_LEVEL_LEAF
RT_NODE_LEAF_256_SET(n256, chunk, value);
--
2.39.0
v22-0016-s-VAR_NODE_HAS_FREE_SLOT-RT_NODE_MUST_GROW.patchtext/x-patch; charset=US-ASCII; name=v22-0016-s-VAR_NODE_HAS_FREE_SLOT-RT_NODE_MUST_GROW.patchDownload
From 48033e8a97ff0d8f6276578c0ffd86209a2e129b Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 22 Jan 2023 12:11:11 +0700
Subject: [PATCH v22 16/22] s/VAR_NODE_HAS_FREE_SLOT/RT_NODE_MUST_GROW/
---
src/include/lib/radixtree.h | 6 +++---
src/include/lib/radixtree_insert_impl.h | 8 ++++----
2 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 8fbc0b5086..cd8b8d1c22 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -372,8 +372,8 @@ typedef struct RT_NODE
#endif
#define NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
-#define VAR_NODE_HAS_FREE_SLOT(node) \
- ((node)->base.n.count < (node)->base.n.fanout)
+#define RT_NODE_MUST_GROW(node) \
+ ((node)->base.n.count == (node)->base.n.fanout)
/* Base type of each node kinds for leaf and inner nodes */
/* The base types must be a be able to accommodate the largest size
@@ -2259,7 +2259,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
/* locally declared macros */
#undef NODE_IS_LEAF
-#undef VAR_NODE_HAS_FREE_SLOT
+#undef RT_NODE_MUST_GROW
#undef RT_NODE_KIND_COUNT
#undef RT_SIZE_CLASS_COUNT
#undef RT_RADIX_TREE_MAGIC
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index b484b7a099..a0f46b37d3 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -43,7 +43,7 @@
break;
}
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n3)))
+ if (unlikely(RT_NODE_MUST_GROW(n3)))
{
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL newnode;
@@ -114,7 +114,7 @@
break;
}
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+ if (unlikely(RT_NODE_MUST_GROW(n32)) &&
n32->base.n.fanout == class32_min.fanout)
{
RT_PTR_ALLOC allocnode;
@@ -137,7 +137,7 @@
node = newnode;
}
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+ if (unlikely(RT_NODE_MUST_GROW(n32)))
{
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL newnode;
@@ -218,7 +218,7 @@
break;
}
- if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+ if (unlikely(RT_NODE_MUST_GROW(n125)))
{
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL newnode;
--
2.39.0
v22-0018-Clean-up-symbols.patchtext/x-patch; charset=US-ASCII; name=v22-0018-Clean-up-symbols.patchDownload
From 67984ba863923017a7c9f976be58fef706eeccd2 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 22 Jan 2023 14:37:53 +0700
Subject: [PATCH v22 18/22] Clean up symbols
Remove remaming stragglers who weren't named "RT_*"
and get rid of the temporary expedient RT_COMMON
block in favor of explicit #undefs everywhere.
---
src/include/lib/radixtree.h | 91 ++++++++++++++-----------
src/include/lib/radixtree_delete_impl.h | 4 +-
src/include/lib/radixtree_insert_impl.h | 4 +-
src/include/lib/radixtree_iter_impl.h | 4 +-
src/include/lib/radixtree_search_impl.h | 4 +-
5 files changed, 58 insertions(+), 49 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 7c3f3dcf4f..95124696ef 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -246,14 +246,6 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
/* generate implementation of the radix tree */
#ifdef RT_DEFINE
-/* macros and types common to all implementations */
-#ifndef RT_COMMON
-#define RT_COMMON
-
-#ifdef RT_DEBUG
-#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
-#endif
-
/* The number of bits encoded in one tree level */
#define RT_NODE_SPAN BITS_PER_BYTE
@@ -321,8 +313,6 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
#define RT_SLAB_BLOCK_SIZE(size) \
Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
-#endif /* RT_COMMON */
-
/* Common type for all nodes types */
typedef struct RT_NODE
{
@@ -370,7 +360,7 @@ typedef struct RT_NODE
#define RT_INVALID_PTR_ALLOC NULL
#endif
-#define NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
+#define RT_NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
#define RT_NODE_MUST_GROW(node) \
((node)->base.n.count == (node)->base.n.fanout)
@@ -916,14 +906,14 @@ RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
static inline RT_PTR_ALLOC
RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
return node->children[node->base.slot_idxs[chunk]];
}
static inline RT_VALUE_TYPE
RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
return node->values[node->base.slot_idxs[chunk]];
}
@@ -934,7 +924,7 @@ RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
static inline bool
RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
return node->children[chunk] != RT_INVALID_PTR_ALLOC;
}
@@ -944,14 +934,14 @@ RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
int idx = BM_IDX(chunk);
int bitnum = BM_BIT(chunk);
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
}
static inline RT_PTR_ALLOC
RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
return node->children[chunk];
}
@@ -959,7 +949,7 @@ RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
static inline RT_VALUE_TYPE
RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
{
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
return node->values[chunk];
}
@@ -968,7 +958,7 @@ RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
static inline void
RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->children[chunk] = child;
}
@@ -979,7 +969,7 @@ RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
int idx = BM_IDX(chunk);
int bitnum = BM_BIT(chunk);
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->isset[idx] |= ((bitmapword) 1 << bitnum);
node->values[chunk] = value;
}
@@ -988,7 +978,7 @@ RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
static inline void
RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
{
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
node->children[chunk] = RT_INVALID_PTR_ALLOC;
}
@@ -998,7 +988,7 @@ RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
int idx = BM_IDX(chunk);
int bitnum = BM_BIT(chunk);
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
node->isset[idx] &= ~((bitmapword) 1 << bitnum);
}
@@ -1458,7 +1448,7 @@ RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
CHECK_FOR_INTERRUPTS();
/* The leaf node doesn't have child pointers */
- if (NODE_IS_LEAF(node))
+ if (RT_NODE_IS_LEAF(node))
{
dsa_free(tree->dsa, ptr);
return;
@@ -1587,7 +1577,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
child = RT_PTR_GET_LOCAL(tree, stored_child);
- if (NODE_IS_LEAF(child))
+ if (RT_NODE_IS_LEAF(child))
break;
if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
@@ -1637,7 +1627,7 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
{
RT_PTR_ALLOC child;
- if (NODE_IS_LEAF(node))
+ if (RT_NODE_IS_LEAF(node))
break;
if (!RT_NODE_SEARCH_INNER(node, key, &child))
@@ -1788,7 +1778,7 @@ RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
node_iter->current_idx = -1;
/* We don't advance the leaf node iterator here */
- if (NODE_IS_LEAF(node))
+ if (RT_NODE_IS_LEAF(node))
return;
/* Advance to the next slot in the inner node */
@@ -1972,7 +1962,7 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
}
case RT_NODE_KIND_256:
{
- if (NODE_IS_LEAF(node))
+ if (RT_NODE_IS_LEAF(node))
{
RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
int cnt = 0;
@@ -1992,6 +1982,9 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
/***************** DEBUG FUNCTIONS *****************/
#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
RT_SCOPE void
RT_STATS(RT_RADIX_TREE *tree)
{
@@ -2012,7 +2005,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
char space[125] = {0};
fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
- NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
(node->kind == RT_NODE_KIND_3) ? 3 :
(node->kind == RT_NODE_KIND_32) ? 32 :
(node->kind == RT_NODE_KIND_125) ? 125 : 256,
@@ -2028,11 +2021,11 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
{
for (int i = 0; i < node->count; i++)
{
- if (NODE_IS_LEAF(node))
+ if (RT_NODE_IS_LEAF(node))
{
RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
- fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
space, n3->base.chunks[i], (uint64) n3->values[i]);
}
else
@@ -2054,11 +2047,11 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
{
for (int i = 0; i < node->count; i++)
{
- if (NODE_IS_LEAF(node))
+ if (RT_NODE_IS_LEAF(node))
{
RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
- fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
space, n32->base.chunks[i], (uint64) n32->values[i]);
}
else
@@ -2090,14 +2083,14 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
}
- if (NODE_IS_LEAF(node))
+ if (RT_NODE_IS_LEAF(node))
{
RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
fprintf(stderr, ", isset-bitmap:");
for (int i = 0; i < BM_IDX(RT_SLOT_IDX_LIMIT); i++)
{
- fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
+ fprintf(stderr, RT_UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
}
fprintf(stderr, "\n");
}
@@ -2107,11 +2100,11 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
continue;
- if (NODE_IS_LEAF(node))
+ if (RT_NODE_IS_LEAF(node))
{
RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
- fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
space, i, (uint64) RT_NODE_LEAF_125_GET_VALUE(n125, i));
}
else
@@ -2134,14 +2127,14 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
{
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
{
- if (NODE_IS_LEAF(node))
+ if (RT_NODE_IS_LEAF(node))
{
RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
continue;
- fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+ fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
space, i, (uint64) RT_NODE_LEAF_256_GET_VALUE(n256, i));
}
else
@@ -2174,7 +2167,7 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
int level = 0;
elog(NOTICE, "-----------------------------------------------------------");
- elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+ elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ")",
tree->ctl->max_val, tree->ctl->max_val);
if (!tree->ctl->root)
@@ -2185,7 +2178,7 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
if (key > tree->ctl->max_val)
{
- elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+ elog(NOTICE, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val",
key, key);
return;
}
@@ -2198,7 +2191,7 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
RT_DUMP_NODE(node, level, false);
- if (NODE_IS_LEAF(node))
+ if (RT_NODE_IS_LEAF(node))
{
uint64 dummy;
@@ -2249,15 +2242,30 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_VALUE_TYPE
/* locally declared macros */
-#undef NODE_IS_LEAF
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef BM_IDX
+#undef BM_BIT
+#undef RT_NODE_IS_LEAF
#undef RT_NODE_MUST_GROW
#undef RT_NODE_KIND_COUNT
#undef RT_SIZE_CLASS_COUNT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
/* type declarations */
#undef RT_RADIX_TREE
#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
#undef RT_PTR_ALLOC
#undef RT_INVALID_PTR_ALLOC
#undef RT_HANDLE
@@ -2295,6 +2303,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_ATTACH
#undef RT_DETACH
#undef RT_GET_HANDLE
+#undef RT_SEARCH
#undef RT_SET
#undef RT_BEGIN_ITERATE
#undef RT_ITERATE_NEXT
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index b9f07f4eb5..99c90771b9 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -17,9 +17,9 @@
uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
#ifdef RT_NODE_LEVEL_LEAF
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
#else
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
#endif
switch (node->kind)
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index e3c3f7a69d..0fcebf1c6b 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -17,10 +17,10 @@
#ifdef RT_NODE_LEVEL_LEAF
const bool inner = false;
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
#else
const bool inner = true;
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
#endif
switch (node->kind)
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index c428531438..823d7107c4 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -18,11 +18,11 @@
#ifdef RT_NODE_LEVEL_LEAF
RT_VALUE_TYPE value;
- Assert(NODE_IS_LEAF(node_iter->node));
+ Assert(RT_NODE_IS_LEAF(node_iter->node));
#else
RT_PTR_LOCAL child = NULL;
- Assert(!NODE_IS_LEAF(node_iter->node));
+ Assert(!RT_NODE_IS_LEAF(node_iter->node));
#endif
#ifdef RT_SHMEM
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index 31138b6a72..c4352045c8 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -17,12 +17,12 @@
#ifdef RT_NODE_LEVEL_LEAF
RT_VALUE_TYPE value = 0;
- Assert(NODE_IS_LEAF(node));
+ Assert(RT_NODE_IS_LEAF(node));
#else
#ifndef RT_ACTION_UPDATE
RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
#endif
- Assert(!NODE_IS_LEAF(node));
+ Assert(!RT_NODE_IS_LEAF(node));
#endif
switch (node->kind)
--
2.39.0
v22-0017-Remove-some-maintenance-hazards-in-growing-nodes.patchtext/x-patch; charset=US-ASCII; name=v22-0017-Remove-some-maintenance-hazards-in-growing-nodes.patchDownload
From 57a34d75a143086ba8bb3920486747957b87552d Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 22 Jan 2023 13:29:18 +0700
Subject: [PATCH v22 17/22] Remove some maintenance hazards in growing nodes
Arrange so that kinds with only one size class have no
"full" suffix. This ensures that splitting such a class
into multiple classes will force compilation errors if
the dev has not thought through which new class should
apply in each case.
For node32, make growing into a new size class a bit
more general. It's not clear we would ever need more
than 2 classes, but let's not put up additional road
blocks. Change partial/full to min/max. It's a bit
shorter this way, matches some newer coding, and allows
for the possibility of a "mid" class.
Also remove RT_KIND_MIN_SIZE_CLASS, since it doesn't
reduce the need for future changes, only makes such
a change further away from the effect.
In passing, move a declaration the block where it's used.
---
src/include/lib/radixtree.h | 66 +++++++++++--------------
src/include/lib/radixtree_insert_impl.h | 16 +++---
2 files changed, 37 insertions(+), 45 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index cd8b8d1c22..7c3f3dcf4f 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -196,12 +196,11 @@
#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
-#define RT_CLASS_3_FULL RT_MAKE_NAME(class_3_full)
-#define RT_CLASS_32_PARTIAL RT_MAKE_NAME(class_32_partial)
-#define RT_CLASS_32_FULL RT_MAKE_NAME(class_32_full)
-#define RT_CLASS_125_FULL RT_MAKE_NAME(class_125_full)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
#define RT_CLASS_256 RT_MAKE_NAME(class_256)
-#define RT_KIND_MIN_SIZE_CLASS RT_MAKE_NAME(kind_min_size_class)
/* generate forward declarations necessary to use the radix tree */
#ifdef RT_DECLARE
@@ -523,10 +522,10 @@ typedef struct RT_NODE_LEAF_256
*/
typedef enum RT_SIZE_CLASS
{
- RT_CLASS_3_FULL = 0,
- RT_CLASS_32_PARTIAL,
- RT_CLASS_32_FULL,
- RT_CLASS_125_FULL,
+ RT_CLASS_3 = 0,
+ RT_CLASS_32_MIN,
+ RT_CLASS_32_MAX,
+ RT_CLASS_125,
RT_CLASS_256
} RT_SIZE_CLASS;
@@ -542,25 +541,25 @@ typedef struct RT_SIZE_CLASS_ELEM
} RT_SIZE_CLASS_ELEM;
static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
- [RT_CLASS_3_FULL] = {
+ [RT_CLASS_3] = {
.name = "radix tree node 3",
.fanout = 3,
.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
.leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
},
- [RT_CLASS_32_PARTIAL] = {
+ [RT_CLASS_32_MIN] = {
.name = "radix tree node 15",
.fanout = 15,
.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
},
- [RT_CLASS_32_FULL] = {
+ [RT_CLASS_32_MAX] = {
.name = "radix tree node 32",
.fanout = 32,
.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
},
- [RT_CLASS_125_FULL] = {
+ [RT_CLASS_125] = {
.name = "radix tree node 125",
.fanout = 125,
.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
@@ -576,14 +575,6 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
-/* Map from the node kind to its minimum size class */
-static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
- [RT_NODE_KIND_3] = RT_CLASS_3_FULL,
- [RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
- [RT_NODE_KIND_125] = RT_CLASS_125_FULL,
- [RT_NODE_KIND_256] = RT_CLASS_256,
-};
-
#ifdef RT_SHMEM
/* A magic value used to identify our radix tree */
#define RT_RADIX_TREE_MAGIC 0x54A48167
@@ -893,7 +884,7 @@ static inline void
RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
{
- const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_FULL].fanout;
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
@@ -905,7 +896,7 @@ static inline void
RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
{
- const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_FULL].fanout;
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
@@ -1105,9 +1096,9 @@ RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL newnode;
- allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, inner);
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, inner);
newnode = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3_FULL, inner);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, inner);
newnode->shift = shift;
tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
tree->ctl->root = allocnode;
@@ -1230,9 +1221,9 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
RT_PTR_LOCAL node;
RT_NODE_INNER_3 *n3;
- allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, true);
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
node = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3_FULL, true);
+ RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
node->shift = shift;
node->count = 1;
@@ -1268,9 +1259,9 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value, RT_PTR_LOCAL
int newshift = shift - RT_NODE_SPAN;
bool inner = newshift > 0;
- allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, inner);
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, inner);
newchild = RT_PTR_GET_LOCAL(tree, allocchild);
- RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3_FULL, inner);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, inner);
newchild->shift = newshift;
RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
@@ -2007,10 +1998,10 @@ RT_STATS(RT_RADIX_TREE *tree)
ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
tree->ctl->num_keys,
tree->ctl->root->shift / RT_NODE_SPAN,
- tree->ctl->cnt[RT_CLASS_3_FULL],
- tree->ctl->cnt[RT_CLASS_32_PARTIAL],
- tree->ctl->cnt[RT_CLASS_32_FULL],
- tree->ctl->cnt[RT_CLASS_125_FULL],
+ tree->ctl->cnt[RT_CLASS_3],
+ tree->ctl->cnt[RT_CLASS_32_MIN],
+ tree->ctl->cnt[RT_CLASS_32_MAX],
+ tree->ctl->cnt[RT_CLASS_125],
tree->ctl->cnt[RT_CLASS_256])));
}
@@ -2292,12 +2283,11 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_SIZE_CLASS
#undef RT_SIZE_CLASS_ELEM
#undef RT_SIZE_CLASS_INFO
-#undef RT_CLASS_3_FULL
-#undef RT_CLASS_32_PARTIAL
-#undef RT_CLASS_32_FULL
-#undef RT_CLASS_125_FULL
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
#undef RT_CLASS_256
-#undef RT_KIND_MIN_SIZE_CLASS
/* function declarations */
#undef RT_CREATE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index a0f46b37d3..e3c3f7a69d 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -49,7 +49,7 @@
RT_PTR_LOCAL newnode;
RT_NODE32_TYPE *new32;
const uint8 new_kind = RT_NODE_KIND_32;
- const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
/* grow node from 3 to 32 */
allocnode = RT_ALLOC_NODE(tree, new_class, inner);
@@ -96,8 +96,7 @@
/* FALLTHROUGH */
case RT_NODE_KIND_32:
{
- const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_PARTIAL];
- const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_FULL];
+ const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
int idx;
@@ -115,11 +114,14 @@
}
if (unlikely(RT_NODE_MUST_GROW(n32)) &&
- n32->base.n.fanout == class32_min.fanout)
+ n32->base.n.fanout < class32_max.fanout)
{
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL newnode;
- const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
+ const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+ Assert(n32->base.n.fanout == class32_min.fanout);
/* grow to the next size class of this kind */
allocnode = RT_ALLOC_NODE(tree, new_class, inner);
@@ -143,7 +145,7 @@
RT_PTR_LOCAL newnode;
RT_NODE125_TYPE *new125;
const uint8 new_kind = RT_NODE_KIND_125;
- const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+ const RT_SIZE_CLASS new_class = RT_CLASS_125;
Assert(n32->base.n.fanout == class32_max.fanout);
@@ -224,7 +226,7 @@
RT_PTR_LOCAL newnode;
RT_NODE256_TYPE *new256;
const uint8 new_kind = RT_NODE_KIND_256;
- const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+ const RT_SIZE_CLASS new_class = RT_CLASS_256;
/* grow node from 125 to 256 */
allocnode = RT_ALLOC_NODE(tree, new_class, inner);
--
2.39.0
v22-0019-Standardize-on-testing-for-is-leaf.patchtext/x-patch; charset=US-ASCII; name=v22-0019-Standardize-on-testing-for-is-leaf.patchDownload
From 9908dfdecbd22eacbc57a7863fe67cbb42b22f90 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 22 Jan 2023 15:10:10 +0700
Subject: [PATCH v22 19/22] Standardize on testing for "is leaf"
Some recent code decided to test for "is inner", so make
everything consistent.
---
src/include/lib/radixtree.h | 38 ++++++++++++-------------
src/include/lib/radixtree_insert_impl.h | 18 ++++++------
2 files changed, 28 insertions(+), 28 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 95124696ef..5927437034 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1019,24 +1019,24 @@ RT_SHIFT_GET_MAX_VAL(int shift)
* Allocate a new node with the given node kind.
*/
static RT_PTR_ALLOC
-RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
{
RT_PTR_ALLOC allocnode;
size_t allocsize;
- if (inner)
- allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
- else
+ if (is_leaf)
allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+ else
+ allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
#ifdef RT_SHMEM
allocnode = dsa_allocate(tree->dsa, allocsize);
#else
- if (inner)
- allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+ if (is_leaf)
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
allocsize);
else
- allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
allocsize);
#endif
@@ -1050,12 +1050,12 @@ RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
/* Initialize the node contents */
static inline void
-RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner)
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
{
- if (inner)
- MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
- else
+ if (is_leaf)
MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+ else
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
node->kind = kind;
@@ -1082,13 +1082,13 @@ static void
RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
{
int shift = RT_KEY_GET_SHIFT(key);
- bool inner = shift > 0;
+ bool is_leaf = shift == 0;
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL newnode;
- allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, inner);
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
newnode = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, inner);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
newnode->shift = shift;
tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
tree->ctl->root = allocnode;
@@ -1107,10 +1107,10 @@ RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
*/
static inline RT_PTR_LOCAL
RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
- uint8 new_kind, uint8 new_class, bool inner)
+ uint8 new_kind, uint8 new_class, bool is_leaf)
{
RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(newnode, new_kind, new_class, inner);
+ RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
RT_COPY_NODE(newnode, node);
return newnode;
@@ -1247,11 +1247,11 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value, RT_PTR_LOCAL
RT_PTR_ALLOC allocchild;
RT_PTR_LOCAL newchild;
int newshift = shift - RT_NODE_SPAN;
- bool inner = newshift > 0;
+ bool is_leaf = newshift == 0;
- allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, inner);
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
newchild = RT_PTR_GET_LOCAL(tree, allocchild);
- RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, inner);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
newchild->shift = newshift;
RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 0fcebf1c6b..22aca0e6cc 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -16,10 +16,10 @@
bool chunk_exists = false;
#ifdef RT_NODE_LEVEL_LEAF
- const bool inner = false;
+ const bool is_leaf = true;
Assert(RT_NODE_IS_LEAF(node));
#else
- const bool inner = true;
+ const bool is_leaf = false;
Assert(!RT_NODE_IS_LEAF(node));
#endif
@@ -52,8 +52,8 @@
const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
/* grow node from 3 to 32 */
- allocnode = RT_ALLOC_NODE(tree, new_class, inner);
- newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
new32 = (RT_NODE32_TYPE *) newnode;
#ifdef RT_NODE_LEVEL_LEAF
@@ -124,7 +124,7 @@
Assert(n32->base.n.fanout == class32_min.fanout);
/* grow to the next size class of this kind */
- allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
newnode = RT_PTR_GET_LOCAL(tree, allocnode);
n32 = (RT_NODE32_TYPE *) newnode;
@@ -150,8 +150,8 @@
Assert(n32->base.n.fanout == class32_max.fanout);
/* grow node from 32 to 125 */
- allocnode = RT_ALLOC_NODE(tree, new_class, inner);
- newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
new125 = (RT_NODE125_TYPE *) newnode;
for (int i = 0; i < class32_max.fanout; i++)
@@ -229,8 +229,8 @@
const RT_SIZE_CLASS new_class = RT_CLASS_256;
/* grow node from 125 to 256 */
- allocnode = RT_ALLOC_NODE(tree, new_class, inner);
- newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
new256 = (RT_NODE256_TYPE *) newnode;
for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
--
2.39.0
v22-0020-Do-some-rewriting-and-proofreading-of-comments.patchtext/x-patch; charset=US-ASCII; name=v22-0020-Do-some-rewriting-and-proofreading-of-comments.patchDownload
From cd7664aea7022902e08d26ef91a1a88421fde3c6 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 23 Jan 2023 18:00:20 +0700
Subject: [PATCH v22 20/22] Do some rewriting and proofreading of comments
In passing, change one ternary operator to if/else.
---
src/include/lib/radixtree.h | 160 +++++++++++++++++++++---------------
1 file changed, 92 insertions(+), 68 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 5927437034..7fcd212ea4 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -9,25 +9,38 @@
* types, each with a different numbers of elements. Depending on the number of
* children, the appropriate node type is used.
*
- * There are some differences from the proposed implementation. For instance,
- * there is not support for path compression and lazy path expansion. The radix
- * tree supports fixed length of the key so we don't expect the tree level
- * wouldn't be high.
+ * WIP: notes about traditional radix tree trading off span vs height...
*
- * Both the key and the value are 64-bit unsigned integer. The inner nodes and
- * the leaf nodes have slightly different structure: for inner tree nodes,
- * shift > 0, store the pointer to its child node as the value. The leaf nodes,
- * shift == 0, have the 64-bit unsigned integer that is specified by the user as
- * the value. The paper refers to this technique as "Multi-value leaves". We
- * choose it to avoid an additional pointer traversal. It is the reason this code
- * currently does not support variable-length keys.
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
*
- * XXX: Most functions in this file have two variants for inner nodes and leaf
- * nodes, therefore there are duplication codes. While this sometimes makes the
- * code maintenance tricky, this reduces branch prediction misses when judging
- * whether the node is a inner node of a leaf node.
+ * The ART paper mentions three ways to implement leaves:
*
- * XXX: the radix tree node never be shrunk.
+ * "- Single-value leaves: The values are stored using an addi-
+ * tional leaf node type which stores one value.
+ * - Multi-value leaves: The values are stored in one of four
+ * different leaf node types, which mirror the structure of
+ * inner nodes, but contain values instead of pointers.
+ * - Combined pointer/value slots: If values fit into point-
+ * ers, no separate node types are necessary. Instead, each
+ * pointer storage location in an inner node can either
+ * store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * WIP: the radix tree nodes don't shrink.
*
* To generate a radix tree and associated functions for a use case several
* macros have to be #define'ed before this file is included. Including
@@ -42,11 +55,11 @@
* - RT_DEFINE - if defined function definitions are generated
* - RT_SCOPE - in which scope (e.g. extern, static inline) do function
* declarations reside
- * - RT_SHMEM - if defined, the radix tree is created in the DSA area
- * so that multiple processes can access it simultaneously.
* - RT_VALUE_TYPE - the type of the value.
*
* Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ * so that multiple processes can access it simultaneously.
* - RT_DEBUG - if defined add stats tracking and debugging functions
*
* Interface
@@ -54,9 +67,6 @@
*
* RT_CREATE - Create a new, empty radix tree
* RT_FREE - Free the radix tree
- * RT_ATTACH - Attach to the radix tree
- * RT_DETACH - Detach from the radix tree
- * RT_GET_HANDLE - Return the handle of the radix tree
* RT_SEARCH - Search a key-value pair
* RT_SET - Set a key-value pair
* RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
@@ -64,11 +74,12 @@
* RT_END_ITER - End iteration
* RT_MEMORY_USAGE - Get the memory usage
*
- * RT_CREATE() creates an empty radix tree in the given memory context
- * and memory contexts for all kinds of radix tree node under the memory context.
+ * Interface for Shared Memory
+ * ---------
*
- * RT_ITERATE_NEXT() ensures returning key-value pairs in the ascending
- * order of the key.
+ * RT_ATTACH - Attach to the radix tree
+ * RT_DETACH - Detach from the radix tree
+ * RT_GET_HANDLE - Return the handle of the radix tree
*
* Optional Interface
* ---------
@@ -360,13 +371,23 @@ typedef struct RT_NODE
#define RT_INVALID_PTR_ALLOC NULL
#endif
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
#define RT_NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
+
#define RT_NODE_MUST_GROW(node) \
((node)->base.n.count == (node)->base.n.fanout)
-/* Base type of each node kinds for leaf and inner nodes */
-/* The base types must be a be able to accommodate the largest size
-class for variable-sized node kinds*/
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
typedef struct RT_NODE_BASE_3
{
RT_NODE n;
@@ -384,9 +405,9 @@ typedef struct RT_NODE_BASE_32
} RT_NODE_BASE_32;
/*
- * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
- * 256, to store indexes into a second array that contains up to 125 values (or
- * child pointers in inner nodes).
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
*/
typedef struct RT_NODE_BASE_125
{
@@ -407,15 +428,8 @@ typedef struct RT_NODE_BASE_256
/*
* Inner and leaf nodes.
*
- * Theres are separate for two main reasons:
- *
- * 1) the value type might be different than something fitting into a pointer
- * width type
- * 2) Need to represent non-existing values in a key-type independent way.
- *
- * 1) is clearly worth being concerned about, but it's not clear 2) is as
- * good. It might be better to just indicate non-existing entries the same way
- * in inner nodes.
+ * Theres are separate because the value type might be different than
+ * something fitting into a pointer-width type.
*/
typedef struct RT_NODE_INNER_3
{
@@ -466,8 +480,10 @@ typedef struct RT_NODE_LEAF_125
} RT_NODE_LEAF_125;
/*
- * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * node-256 is the largest node type. This node has an array
* for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
*/
typedef struct RT_NODE_INNER_256
{
@@ -481,7 +497,10 @@ typedef struct RT_NODE_LEAF_256
{
RT_NODE_BASE_256 base;
- /* isset is a bitmap to track which slot is in use */
+ /*
+ * Unlike with inner256, zero is a valid value here, so we use a
+ * bitmap to track which slot is in use.
+ */
bitmapword isset[BM_IDX(RT_NODE_MAX_SLOTS)];
/* Slots for 256 values */
@@ -570,7 +589,8 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
#define RT_RADIX_TREE_MAGIC 0x54A48167
#endif
-/* A radix tree with nodes */
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
typedef struct RT_RADIX_TREE_CONTROL
{
#ifdef RT_SHMEM
@@ -588,7 +608,7 @@ typedef struct RT_RADIX_TREE_CONTROL
#endif
} RT_RADIX_TREE_CONTROL;
-/* A radix tree with nodes */
+/* Entry point for allocating and accessing the tree */
typedef struct RT_RADIX_TREE
{
MemoryContext context;
@@ -613,15 +633,15 @@ typedef struct RT_RADIX_TREE
* RT_NODE_ITER struct is used to track the iteration within a node.
*
* RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
- * in order to track the iteration of each level. During the iteration, we also
+ * in order to track the iteration of each level. During iteration, we also
* construct the key whenever updating the node iteration information, e.g., when
* advancing the current index within the node or when moving to the next node
* at the same level.
-+ *
-+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
-+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
-+ * We need either a safeguard to disallow other processes to begin the iteration
-+ * while one process is doing or to allow multiple processes to do the iteration.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
*/
typedef struct RT_NODE_ITER
{
@@ -637,7 +657,7 @@ typedef struct RT_ITER
RT_NODE_ITER stack[RT_MAX_LEVEL];
int stack_len;
- /* The key is being constructed during the iteration */
+ /* The key is constructed during iteration */
uint64 key;
} RT_ITER;
@@ -672,8 +692,8 @@ RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
}
/*
- * Return index of the first element in 'base' that equals 'key'. Return -1
- * if there is no such element.
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
*/
static inline int
RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
@@ -693,7 +713,8 @@ RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
}
/*
- * Return index of the chunk to insert into chunks in the given node.
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
*/
static inline int
RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
@@ -744,7 +765,7 @@ RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
/* replicate the search key */
spread_chunk = vector8_broadcast(chunk);
- /* compare to the 32 keys stored in the node */
+ /* compare to all 32 keys stored in the node */
vector8_load(&haystack1, &node->chunks[0]);
vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
cmp1 = vector8_eq(spread_chunk, haystack1);
@@ -768,7 +789,7 @@ RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
}
/*
- * Return index of the node's chunk array to insert into,
+ * Return index of the chunk and slot arrays for inserting into the node,
* such that the chunk array remains ordered.
*/
static inline int
@@ -809,7 +830,7 @@ RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
* This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
* no unsigned uint8 comparison instruction exists, at least for SSE2. So
* we need to play some trickery using vector8_min() to effectively get
- * <=. There'll never be any equal elements in the current uses, but that's
+ * <=. There'll never be any equal elements in urrent uses, but that's
* what we get here...
*/
spread_chunk = vector8_broadcast(chunk);
@@ -834,6 +855,7 @@ RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
#endif
}
+
/*
* Functions to manipulate both chunks array and children/values array.
* These are used for node-3 and node-32.
@@ -993,18 +1015,19 @@ RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
}
/*
- * Return the shift that is satisfied to store the given key.
+ * Return the largest shift that will allowing storing the given key.
*/
static inline int
RT_KEY_GET_SHIFT(uint64 key)
{
- return (key == 0)
- ? 0
- : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+ if (key == 0)
+ return 0;
+ else
+ return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
}
/*
- * Return the max value stored in a node with the given shift.
+ * Return the max value that can be stored in the tree with the given shift.
*/
static uint64
RT_SHIFT_GET_MAX_VAL(int shift)
@@ -1155,6 +1178,7 @@ RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
#endif
}
+/* Update the parent's pointer when growing a node */
static inline void
RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
{
@@ -1182,7 +1206,7 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
if (parent == old_child)
{
- /* Replace the root node with the new large node */
+ /* Replace the root node with the new larger node */
tree->ctl->root = new_child;
}
else
@@ -1192,8 +1216,8 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
}
/*
- * The radix tree doesn't sufficient height. Extend the radix tree so it can
- * store the key.
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
*/
static void
RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
@@ -1337,7 +1361,7 @@ RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stor
#undef RT_NODE_LEVEL_INNER
}
-/* Like, RT_NODE_INSERT_INNER, but for leaf nodes */
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
static bool
RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
uint64 key, RT_VALUE_TYPE value)
@@ -1377,7 +1401,7 @@ RT_CREATE(MemoryContext ctx)
#else
tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
- /* Create the slab allocator for each size class */
+ /* Create a slab context for each size class */
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
@@ -1570,7 +1594,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
parent = RT_PTR_GET_LOCAL(tree, stored_child);
shift = parent->shift;
- /* Descend the tree until a leaf node */
+ /* Descend the tree until we reach a leaf node */
while (shift >= 0)
{
RT_PTR_ALLOC new_child;
--
2.39.0
v22-0022-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchtext/x-patch; charset=US-ASCII; name=v22-0022-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload
From cd1cc048b81abbd942a9a7e66b1d64a9a844ac84 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 17 Jan 2023 17:20:37 +0700
Subject: [PATCH v22 22/22] Use TIDStore for storing dead tuple TID during lazy
vacuum
Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which is not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.
This changes to use TIDStore for this purpose. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.
Also, since we are no longer able to exactly estimate the maximum
number of TIDs can be stored based on the amount of memory. It also
changes to the column names max_dead_tuples and num_dead_tuples and to
show the progress information in bytes.
Furthermore, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, which is the
inital DSA segment size. Due to that, this change increase the minimum
maintenance_work_mem from 1MB to 2MB.
XXX: needs to bump catalog version
---
doc/src/sgml/monitoring.sgml | 8 +-
src/backend/access/heap/vacuumlazy.c | 210 +++++++--------------
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 76 +-------
src/backend/commands/vacuumparallel.c | 64 ++++---
src/backend/storage/lmgr/lwlock.c | 2 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/commands/progress.h | 4 +-
src/include/commands/vacuum.h | 25 +--
src/include/storage/lwlock.h | 1 +
src/test/regress/expected/cluster.out | 2 +-
src/test/regress/expected/create_index.out | 2 +-
src/test/regress/expected/rules.out | 4 +-
src/test/regress/sql/cluster.sql | 2 +-
src/test/regress/sql/create_index.sql | 2 +-
15 files changed, 138 insertions(+), 268 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d936aa3da3..0230c74e3d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6870,10 +6870,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>max_dead_tuples</structfield> <type>bigint</type>
+ <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples that we can store before needing to perform
+ Amount of dead tuple data that we can store before needing to perform
an index vacuum cycle, based on
<xref linkend="guc-maintenance-work-mem"/>.
</para></entry>
@@ -6881,10 +6881,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>num_dead_tuples</structfield> <type>bigint</type>
+ <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples collected since the last index vacuum cycle.
+ Amount of dead tuple data collected since the last index vacuum cycle.
</para></entry>
</row>
</tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..90f8a5e087 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TidStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -220,17 +221,21 @@ typedef struct LVRelState
typedef struct LVPagePruneState
{
bool hastup; /* Page prevents rel truncation? */
- bool has_lpdead_items; /* includes existing LP_DEAD items */
+
+ /* collected LP_DEAD items including existing LP_DEAD items */
+ int lpdead_items;
+ OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
/*
* State describes the proper VM bit states to set for the page following
- * pruning and freezing. all_visible implies !has_lpdead_items, but don't
+ * pruning and freezing. all_visible implies !HAS_LPDEAD_ITEMS(), but don't
* trust all_frozen result unless all_visible is also set to true.
*/
bool all_visible; /* Every item visible to all? */
bool all_frozen; /* provided all_visible is also true */
TransactionId visibility_cutoff_xid; /* For recovery conflicts */
} LVPagePruneState;
+#define HAS_LPDEAD_ITEMS(state) (((state).lpdead_items) > 0)
/* Struct for saving and restoring vacuum error information. */
typedef struct LVSavedErrInfo
@@ -259,8 +264,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -825,21 +831,21 @@ lazy_scan_heap(LVRelState *vacrel)
blkno,
next_unskippable_block,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TidStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
const int initprog_index[] = {
PROGRESS_VACUUM_PHASE,
PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
- PROGRESS_VACUUM_MAX_DEAD_TUPLES
+ PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
};
int64 initprog_val[3];
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +912,7 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ if (tidstore_is_full(vacrel->dead_items))
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -1018,7 +1023,7 @@ lazy_scan_heap(LVRelState *vacrel)
*/
lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
- Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+ Assert(!prunestate.all_visible || !HAS_LPDEAD_ITEMS(prunestate));
/* Remember the location of the last page with nonremovable tuples */
if (prunestate.hastup)
@@ -1034,14 +1039,12 @@ lazy_scan_heap(LVRelState *vacrel)
* performed here can be thought of as the one-pass equivalent of
* a call to lazy_vacuum().
*/
- if (prunestate.has_lpdead_items)
+ if (HAS_LPDEAD_ITEMS(prunestate))
{
Size freespace;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
- /* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+ prunestate.lpdead_items, buf, vmbuffer);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1081,16 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(tidstore_num_tids(dead_items) == 0);
+ }
+ else if (HAS_LPDEAD_ITEMS(prunestate))
+ {
+ /* Save details of the LP_DEAD items from the page */
+ tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
+ prunestate.lpdead_items);
+
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
}
/*
@@ -1145,7 +1157,7 @@ lazy_scan_heap(LVRelState *vacrel)
* There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
* set, however.
*/
- else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+ else if (HAS_LPDEAD_ITEMS(prunestate) && PageIsAllVisible(page))
{
elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
vacrel->relname, blkno);
@@ -1193,7 +1205,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Final steps for block: drop cleanup lock, record free space in the
* FSM
*/
- if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+ if (HAS_LPDEAD_ITEMS(prunestate) && vacrel->do_index_vacuuming)
{
/*
* Wait until lazy_vacuum_heap_rel() to save free space. This
@@ -1249,7 +1261,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (tidstore_num_tids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1543,13 +1555,11 @@ lazy_scan_prune(LVRelState *vacrel,
HTSV_Result res;
int tuples_deleted,
tuples_frozen,
- lpdead_items,
live_tuples,
recently_dead_tuples;
int nnewlpdead;
HeapPageFreeze pagefrz;
int64 fpi_before = pgWalUsage.wal_fpi;
- OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1571,7 +1581,6 @@ retry:
pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
tuples_deleted = 0;
tuples_frozen = 0;
- lpdead_items = 0;
live_tuples = 0;
recently_dead_tuples = 0;
@@ -1580,9 +1589,9 @@ retry:
*
* We count tuples removed by the pruning step as tuples_deleted. Its
* final value can be thought of as the number of tuples that have been
- * deleted from the table. It should not be confused with lpdead_items;
- * lpdead_items's final value can be thought of as the number of tuples
- * that were deleted from indexes.
+ * deleted from the table. It should not be confused with
+ * prunestate->lpdead_items; prunestate->lpdead_items's final value can
+ * be thought of as the number of tuples that were deleted from indexes.
*/
tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
InvalidTransactionId, 0, &nnewlpdead,
@@ -1593,7 +1602,7 @@ retry:
* requiring freezing among remaining tuples with storage
*/
prunestate->hastup = false;
- prunestate->has_lpdead_items = false;
+ prunestate->lpdead_items = 0;
prunestate->all_visible = true;
prunestate->all_frozen = true;
prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1638,7 +1647,7 @@ retry:
* (This is another case where it's useful to anticipate that any
* LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
*/
- deadoffsets[lpdead_items++] = offnum;
+ prunestate->deadoffsets[prunestate->lpdead_items++] = offnum;
continue;
}
@@ -1875,7 +1884,7 @@ retry:
*/
#ifdef USE_ASSERT_CHECKING
/* Note that all_frozen value does not matter when !all_visible */
- if (prunestate->all_visible && lpdead_items == 0)
+ if (prunestate->all_visible && prunestate->lpdead_items == 0)
{
TransactionId cutoff;
bool all_frozen;
@@ -1888,28 +1897,9 @@ retry:
}
#endif
- /*
- * Now save details of the LP_DEAD items from the page in vacrel
- */
- if (lpdead_items > 0)
+ if (prunestate->lpdead_items > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
-
vacrel->lpdead_item_pages++;
- prunestate->has_lpdead_items = true;
-
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
/*
* It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1928,7 +1918,7 @@ retry:
/* Finally, add page-local counts to whole-VACUUM counts */
vacrel->tuples_deleted += tuples_deleted;
vacrel->tuples_frozen += tuples_frozen;
- vacrel->lpdead_items += lpdead_items;
+ vacrel->lpdead_items += prunestate->lpdead_items;
vacrel->live_tuples += live_tuples;
vacrel->recently_dead_tuples += recently_dead_tuples;
}
@@ -2129,8 +2119,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TidStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2128,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2198,7 +2180,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
return;
}
@@ -2227,7 +2209,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2254,8 +2236,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2300,7 +2282,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
}
/*
@@ -2373,7 +2355,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2410,10 +2392,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index = 0;
BlockNumber vacuumed_pages = 0;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2411,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
VACUUM_ERRCB_PHASE_VACUUM_HEAP,
InvalidBlockNumber, InvalidOffsetNumber);
- while (index < vacrel->dead_items->num_items)
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ while ((result = tidstore_iterate_next(iter)) != NULL)
{
BlockNumber blkno;
Buffer buf;
@@ -2437,7 +2421,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ blkno = result->blkno;
vacrel->blkno = blkno;
/*
@@ -2451,7 +2435,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+ buf, vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2461,6 +2446,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
vacuumed_pages++;
}
+ tidstore_end_iterate(iter);
vacrel->blkno = InvalidBlockNumber;
if (BufferIsValid(vmbuffer))
@@ -2470,14 +2456,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2495,11 +2480,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* LP_DEAD item on the page. The return value is the first index immediately
* after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets, Buffer buffer, Buffer vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int nunused = 0;
@@ -2518,16 +2502,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = offsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2576,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -3093,46 +3071,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
return vacrel->nonempty_pages;
}
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
/*
* Allocate dead_items (either using palloc, or in dynamic shared memory).
* Sets dead_items in vacrel for caller.
@@ -3143,11 +3081,9 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
+ int vac_work_mem = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
@@ -3174,7 +3110,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
+ vac_work_mem,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3187,11 +3123,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = tidstore_create(vac_work_mem, NULL);
}
/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..a526e607fe 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1165,7 +1165,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
END AS phase,
S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
- S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+ S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7b1a4b127e..358ad25996 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int vac_cmp_itemptr(const void *left, const void *right);
/*
* Primary entry point for manual VACUUM and ANALYZE commands
@@ -2303,16 +2302,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TidStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ tidstore_num_tids(dead_items))));
return istat;
}
@@ -2343,18 +2342,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
@@ -2365,60 +2352,7 @@ vac_max_items_to_alloc_size(int max_items)
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch((void *) itemptr,
- (void *) dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
-
- return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
- BlockNumber lblk,
- rblk;
- OffsetNumber loff,
- roff;
-
- lblk = ItemPointerGetBlockNumber((ItemPointer) left);
- rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
- if (lblk < rblk)
- return -1;
- if (lblk > rblk)
- return 1;
-
- loff = ItemPointerGetOffsetNumber((ItemPointer) left);
- roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
- if (loff < roff)
- return -1;
- if (loff > roff)
- return 1;
+ TidStore *dead_items = (TidStore *) state;
- return 0;
+ return tidstore_lookup_tid(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..4c0ce4b7e6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
* use small integers.
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 3
#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 4
#define PARALLEL_VACUUM_KEY_WAL_USAGE 5
@@ -103,6 +103,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TidStore */
+ tidstore_handle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ dsa_area *dead_items_area;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = tidstore_create(vac_work_mem, dead_items_dsa);
+ pvs->dead_items = dead_items;
+ pvs->dead_items_area = dead_items_dsa;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = tidstore_get_handle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ tidstore_destroy(pvs->dead_items);
+ dsa_detach(pvs->dead_items_area);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TidStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ tidstore_detach(pvs.dead_items);
+ dsa_detach(dead_items_area);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 55b3a04097..c223a7dc94 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -192,6 +192,8 @@ static const char *const BuiltinTrancheNames[] = {
"LogicalRepLauncherDSA",
/* LWTRANCHE_LAUNCHER_HASH: */
"LogicalRepLauncherHash",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4ac808ed22..422914f0a9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2312,7 +2312,7 @@ struct config_int ConfigureNamesInt[] =
GUC_UNIT_KB
},
&maintenance_work_mem,
- 65536, 1024, MAX_KILOBYTES,
+ 65536, 2048, MAX_KILOBYTES,
NULL, NULL, NULL
},
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
#define PROGRESS_VACUUM_HEAP_BLKS_SCANNED 2
#define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED 3
#define PROGRESS_VACUUM_NUM_INDEX_VACUUMS 4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES 5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES 5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES 6
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..220d89fff7 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
MultiXactId MultiXactCutoff;
};
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TidStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int vac_work_mem,
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 07002fdfbe..537b34b30c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DATA,
LWTRANCHE_LAUNCHER_DSA,
LWTRANCHE_LAUNCHER_HASH,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
-- ensure we don't use the index in CLUSTER nor the checking SELECTs
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
-- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..f6ae02eb14 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param3 AS heap_blks_scanned,
s.param4 AS heap_blks_vacuumed,
s.param5 AS index_vacuum_count,
- s.param6 AS max_dead_tuples,
- s.param7 AS num_dead_tuples
+ s.param6 AS max_dead_tuple_bytes,
+ s.param7 AS dead_tuple_bytes
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
--
2.39.0
v22-0021-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchtext/x-patch; charset=US-ASCII; name=v22-0021-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload
From 777bc2d7c18cba89122e581962634696e72ada56 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v22 21/22] Add TIDStore, to store sets of TIDs
(ItemPointerData) efficiently.
The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.
The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.
This includes a unit test module, in src/test/modules/test_tidstore.
---
doc/src/sgml/monitoring.sgml | 4 +
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 626 ++++++++++++++++++
src/backend/storage/lmgr/lwlock.c | 2 +
src/include/access/tidstore.h | 49 ++
src/include/storage/lwlock.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_tidstore/Makefile | 23 +
.../test_tidstore/expected/test_tidstore.out | 13 +
src/test/modules/test_tidstore/meson.build | 35 +
.../test_tidstore/sql/test_tidstore.sql | 7 +
.../test_tidstore/test_tidstore--1.0.sql | 8 +
.../modules/test_tidstore/test_tidstore.c | 189 ++++++
.../test_tidstore/test_tidstore.control | 4 +
16 files changed, 965 insertions(+)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
create mode 100644 src/test/modules/test_tidstore/Makefile
create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
create mode 100644 src/test/modules/test_tidstore/meson.build
create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
create mode 100644 src/test/modules/test_tidstore/test_tidstore.control
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..d936aa3da3 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2192,6 +2192,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Waiting to access a shared TID bitmap during a parallel bitmap
index scan.</entry>
</row>
+ <row>
+ <entry><literal>SharedTidStore</literal></entry>
+ <entry>Waiting to access a shared TID store.</entry>
+ </row>
<row>
<entry><literal>SharedTupleStore</literal></entry>
<entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..26e3077b5e
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,626 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a Tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach(). It can support concurrent updates but only one process
+ * is allowed to iterate over the TidStore at a time.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, item pointers are represented as a pair of 64-bit
+ * key and 64-bit value. First, we construct 64-bit unsigned integer key that
+ * combines the block number and the offset number. The lowest 11 bits represent
+ * the offset number, and the next 32 bits are block number. That is, only 43
+ * bits are used:
+ *
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ *
+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with
+ * the bits, because smaller keys could help keeping the radix tree shallow.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits, and
+ * the rest 37 bits are used as the key:
+ *
+ * value = bitmap representation of XXXXXX
+ * key = XXXXXYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYuu
+ *
+ * The maximum height of the radix tree is 5.
+ *
+ * XXX: if we want to support non-heap table AM that want to use the full
+ * range of possible offset numbers, we'll need to reconsider
+ * TIDSTORE_OFFSET_NBITS value.
+ */
+#define TIDSTORE_OFFSET_NBITS 11
+#define TIDSTORE_VALUE_NBITS 6
+
+/*
+ * Memory consumption depends on the number of Tids stored, but also on the
+ * distribution of them, how the radix tree stores, and the memory management
+ * that backed the radix tree. The maximum bytes that a TidStore can
+ * use is specified by the max_bytes in tidstore_create(). We want the total
+ * amount of memory consumption not to exceed the max_bytes.
+ *
+ * In non-shared cases, the radix tree uses slab allocators for each kind of
+ * node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate the
+ * largest radix tree node in a new slab block, which is approximately 70kB.
+ * Therefore, we deduct 70kB from the maximum bytes.
+ *
+ * In shared cases, DSA allocates the memory segments big enough to follow
+ * a geometric series that approximately doubles the total DSA size (see
+ * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+ * size and the simulation showed, the 75% threshold for the maximum bytes
+ * perfectly works in case where it is a power-of-2, and the 60% threshold
+ * works for other cases.
+ */
+#define TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT (1024L * 70) /* 70kB */
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2 (float) 0.75
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO (float) 0.6
+
+#define KEY_GET_BLKNO(key) \
+ ((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+#define BLKNO_GET_KEY(blkno) \
+ (((uint64) (blkno) << (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The header object for a TidStore */
+typedef struct TidStoreControl
+{
+ /*
+ * 'num_tids' is the number of Tids stored so far. 'max_byte' is the maximum
+ * bytes a TidStore can use. These two fields are commonly used in both
+ * non-shared case and shared case.
+ */
+ uint64 num_tids;
+ uint64 max_bytes;
+
+ /* The below fields are used only in shared case */
+
+ uint32 magic;
+
+ /* protect the shared fields */
+ LWLock lock;
+
+ /* handles for TidStore and radix tree */
+ tidstore_handle handle;
+ shared_rt_handle tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+ /*
+ * Control object. This is allocated in DSA area 'area' in the shared
+ * case, otherwise in backend-local memory.
+ */
+ TidStoreControl *control;
+
+ /* Storage for Tids. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ local_rt_radix_tree *local;
+ shared_rt_radix_tree *shared;
+ } tree;
+
+ /* DSA area for TidStore if used */
+ dsa_area *area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+ TidStore *ts;
+
+ /* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ shared_rt_iter *shared;
+ local_rt_iter *local;
+ } tree_iter;
+
+ /* we returned all tids? */
+ bool finished;
+
+ /* save for the next iteration */
+ uint64 next_key;
+ uint64 next_val;
+
+ /* output for the caller */
+ TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(uint64 max_bytes, dsa_area *area)
+{
+ TidStore *ts;
+
+ ts = palloc0(sizeof(TidStore));
+
+ /*
+ * Create the radix tree for the main storage.
+ */
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+ float ratio = ((max_bytes & (max_bytes - 1)) == 0)
+ ? TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2
+ : TIDSTORE_SHARED_MAX_MEMORY_RATIO;
+
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, area);
+
+ dp = dsa_allocate0(area, sizeof(TidStoreControl));
+ ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+ ts->control->max_bytes =(uint64) (max_bytes * ratio);
+ ts->area = area;
+
+ ts->control->magic = TIDSTORE_MAGIC;
+ LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+ ts->control->handle = dp;
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+ }
+ else
+ {
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+ ts->control->max_bytes = max_bytes - TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT;
+ }
+
+ return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ /* create per-backend state */
+ ts = palloc0(sizeof(TidStore));
+
+ /* Find the control object in shared memory */
+ control = handle;
+
+ /* Set up the TidStore */
+ ts->control = (TidStoreControl *) dsa_get_address(area, control);
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+ ts->area = area;
+
+ return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ shared_rt_detach(ts->tree.shared);
+ pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory. The caller must be certain that
+ * no other backend will attempt to access the TidStore before calling this
+ * function. Other backend must explicitly call tidstore_detach to free up
+ * backend-local memory associated with the TidStore. The backend that calls
+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix
+ * tree.
+ */
+ ts->control->magic = 0;
+ dsa_free(ts->area, ts->control->handle);
+ shared_rt_free(ts->tree.shared);
+ }
+ else
+ {
+ pfree(ts->control);
+ local_rt_free(ts->tree.local);
+ }
+
+ pfree(ts);
+}
+
+/* Forget all collected Tids */
+void
+tidstore_reset(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ if (TidStoreIsShared(ts))
+ {
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /*
+ * Free the radix tree and return allocated DSA segments to
+ * the operating system.
+ */
+ shared_rt_free(ts->tree.shared);
+ dsa_trim(ts->area);
+
+ /* Recreate the radix tree */
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area);
+
+ /* update the radix tree handle as we recreated it */
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+
+ LWLockRelease(&ts->control->lock);
+ }
+ else
+ {
+ local_rt_free(ts->tree.local);
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+ }
+}
+
+static inline void
+tidstore_insert_kv(TidStore *ts, uint64 key, uint64 val)
+{
+ if (TidStoreIsShared(ts))
+ {
+ /*
+ * Since the shared radix tree supports concurrent insert,
+ * we don't need to acquire the lock.
+ */
+ shared_rt_set(ts->tree.shared, key, val);
+ }
+ else
+ local_rt_set(ts->tree.local, key, val);
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+#define NUM_KEYS_PER_BLOCK (1 << (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS))
+ ItemPointerData tid;
+ uint64 key_base;
+ uint64 values[NUM_KEYS_PER_BLOCK] = {0};
+
+ ItemPointerSetBlockNumber(&tid, blkno);
+ key_base = BLKNO_GET_KEY(blkno);
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint64 key;
+ uint32 off;
+ int idx;
+
+ ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+ /* encode the Tid to key and val */
+ key = tid_to_key_off(&tid, &off);
+
+ idx = key - key_base;
+ Assert(idx >= 0 && idx < NUM_KEYS_PER_BLOCK);
+
+ values[idx] |= UINT64CONST(1) << off;
+ }
+
+ /* insert the calculated key-values to the tree */
+ for (int i = 0; i < NUM_KEYS_PER_BLOCK; i++)
+ {
+ if (values[i])
+ {
+ uint64 key = key_base + i;
+
+ tidstore_insert_kv(ts, key, values[i]);
+ }
+ }
+
+ if (TidStoreIsShared(ts))
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /* update statistics */
+ ts->control->num_tids += num_offsets;
+
+ if (TidStoreIsShared(ts))
+ LWLockRelease(&ts->control->lock);
+}
+
+/* Return true if the given Tid is present in TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val;
+ uint32 off;
+ bool found;
+
+ key = tid_to_key_off(tid, &off);
+
+ found = TidStoreIsShared(ts) ?
+ shared_rt_search(ts->tree.shared, key, &val) :
+ local_rt_search(ts->tree.local, key, &val);
+
+ if (!found)
+ return false;
+
+ return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. The caller must be certain that
+ * no other backend will attempt to update the TidStore during the iteration.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+ TidStoreIter *iter;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ iter = palloc0(sizeof(TidStoreIter));
+ iter->ts = ts;
+ iter->result.blkno = InvalidBlockNumber;
+
+ if (TidStoreIsShared(ts))
+ iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+ else
+ iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+ /* If the TidStore is empty, there is no business */
+ if (tidstore_num_tids(ts) == 0)
+ iter->finished = true;
+
+ return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+ else
+ return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a TidStoreIterResult representing Tids
+ * in one page. Offset numbers in the result is sorted.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+ TidStoreIterResult *result = &(iter->result);
+
+ if (iter->finished)
+ return NULL;
+
+ if (BlockNumberIsValid(result->blkno))
+ {
+ result->num_offsets = 0;
+ tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (tidstore_iter_kv(iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = KEY_GET_BLKNO(key);
+
+ if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+ {
+ /*
+ * Remember the key-value pair for the next block for the
+ * next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+ return result;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_extract_tids(iter, key, val);
+ }
+
+ iter->finished = true;
+ return result;
+}
+
+/* Finish an iteration over TidStore */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+ if (TidStoreIsShared(iter->ts))
+ shared_rt_end_iterate(iter->tree_iter.shared);
+ else
+ local_rt_end_iterate(iter->tree_iter.local);
+
+ pfree(iter);
+}
+
+/* Return the number of Tids we collected so far */
+uint64
+tidstore_num_tids(TidStore *ts)
+{
+ uint64 num_tids;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ if (TidStoreIsShared(ts))
+ return ts->control->num_tids;
+
+ LWLockAcquire(&ts->control->lock, LW_SHARED);
+ num_tids = ts->control->num_tids;
+ LWLockRelease(&ts->control->lock);
+
+ return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+uint64
+tidstore_max_memory(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+uint64
+tidstore_memory_usage(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return (uint64) sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+ else
+ return (uint64) sizeof(TidStore) + sizeof(TidStore) +
+ local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->handle;
+}
+
+/* Extract Tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+ TidStoreIterResult *result = (&iter->result);
+
+ for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ if ((val & (UINT64CONST(1) << i)) == 0)
+ continue;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= i;
+
+ off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+ result->offsets[result->num_offsets++] = off;
+ }
+
+ result->blkno = KEY_GET_BLKNO(key);
+}
+
+/*
+ * Encode a Tid to key and val.
+ */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint64 tid_i;
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+ *off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+ upper = tid_i >> TIDSTORE_VALUE_NBITS;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ return upper;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
"SharedTupleStore",
/* LWTRANCHE_SHARED_TIDBITMAP: */
"SharedTidBitmap",
+ /* LWTRANCHE_SHARED_TIDSTORE: */
+ "SharedTidStore",
/* LWTRANCHE_PARALLEL_APPEND: */
"ParallelAppend",
/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..ec3d9f87f5
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+ BlockNumber blkno;
+ OffsetNumber offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+ int num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(uint64 max_bytes, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern uint64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern uint64 tidstore_max_memory(TidStore *ts);
+extern uint64 tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif /* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
LWTRANCHE_SHARED_TUPLESTORE,
LWTRANCHE_SHARED_TIDBITMAP,
+ LWTRANCHE_SHARED_TIDSTORE,
LWTRANCHE_PARALLEL_APPEND,
LWTRANCHE_PER_XACT_PREDICATE_LIST,
LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
test_rls_hooks \
test_shm_mq \
test_slru \
+ test_tidstore \
unsafe_tests \
worker_spi
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
subdir('test_rls_hooks')
subdir('test_shm_mq')
subdir('test_slru')
+subdir('test_tidstore')
subdir('unsafe_tests')
subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+ $(WIN32RES) \
+ test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE: testing empty tidstore
+NOTICE: testing basic operations
+ test_tidstore
+---------------
+
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+ 'test_tidstore.c',
+)
+
+if host_system == 'windows'
+ test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_tidstore',
+ '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+ test_tidstore_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+ 'test_tidstore.control',
+ 'test_tidstore--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_tidstore',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_tidstore',
+ ],
+ },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..5d38387450
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,189 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ * Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+ ItemPointerData tid;
+ bool found;
+
+ ItemPointerSet(&tid, blkno, off);
+
+ found = tidstore_lookup_tid(ts, &tid);
+
+ if (found != expect)
+ elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+ blkno, off, found, expect);
+}
+
+static void
+test_basic(void)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS 5
+#define TEST_TIDSTORE_NUM_OFFSETS 11
+#define IS_POWER_OF_TWO(x) (((x) & (x - 1)) == 0)
+
+ TidStore *ts;
+ TidStoreIter *iter;
+ TidStoreIterResult *iter_result;
+ BlockNumber blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+ };
+ BlockNumber blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+ };
+ OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS] = {
+ 1 << 5, 1 << 6, 1 << 7, 1 << 8, 1 << 9,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3, 1 << 4,
+ 1 << 10
+ };
+ OffsetNumber offs_sorted[TEST_TIDSTORE_NUM_OFFSETS] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3, 1 << 4,
+ 1 << 5, 1 << 6, 1 << 7, 1 << 8, 1 << 9,
+ 1 << 10
+ };
+ int blk_idx;
+
+ elog(NOTICE, "testing basic operations");
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, NULL);
+
+ /* add tids */
+ for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+ tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* lookup test */
+ for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+ off++)
+ {
+ check_tid(ts, 0, off, IS_POWER_OF_TWO(off));
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, IS_POWER_OF_TWO(off));
+ }
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+ elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+ tidstore_num_tids(ts),
+ TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* iteration test */
+ iter = tidstore_begin_iterate(ts);
+ blk_idx = 0;
+ while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+ {
+ /* check the returned block number */
+ if (blks_sorted[blk_idx] != iter_result->blkno)
+ elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+ iter_result->blkno, blks_sorted[blk_idx]);
+
+ /* check the returned offset numbers */
+ if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+ elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+ iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+ for (int i = 0; i < iter_result->num_offsets; i++)
+ {
+ if (offs_sorted[i] != iter_result->offsets[i])
+ elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+ iter_result->offsets[i], iter_result->blkno,
+ offs_sorted[i]);
+ }
+
+ blk_idx++;
+ }
+
+ if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+ elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+ blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+ /* remove all tids */
+ tidstore_reset(ts);
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+ /* lookup test for empty store */
+ for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+ off++)
+ {
+ check_tid(ts, 0, off, false);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, false);
+ }
+
+ tidstore_destroy(ts);
+}
+
+static void
+test_empty(void)
+{
+ TidStore *ts;
+ TidStoreIter *iter;
+ ItemPointerData tid;
+
+ elog(NOTICE, "testing empty tidstore");
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, NULL);
+
+ ItemPointerSet(&tid, 0, FirstOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+ ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+ MaxBlockNumber, MaxOffsetNumber);
+
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+ if (tidstore_is_full(ts))
+ elog(ERROR, "tidstore_is_full on empty store returned true");
+
+ iter = tidstore_begin_iterate(ts);
+
+ if (tidstore_iterate_next(iter) != NULL)
+ elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+ tidstore_end_iterate(iter);
+
+ tidstore_destroy(ts);
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+ test_empty();
+ test_basic();
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
--
2.39.0
On Mon, Jan 23, 2023 at 6:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
Attached is a rebase to fix conflicts from recent commits.
I have reviewed v22-0022* patch and I have some comments.
1.
It also changes to the column names max_dead_tuples and num_dead_tuples and to
show the progress information in bytes.
I think this statement needs to be rephrased.
2.
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
* This has the right signature to be an IndexBulkDeleteCallback.
*
* Assumes dead_items array is sorted (in ascending TID order).
*/
I think this comment 'Assumes dead_items array is sorted' is not valid anymore.
3.
We are changing the min value of 'maintenance_work_mem' to 2MB. Should
we do the same for the 'autovacuum_work_mem'?
4.
+
+ /* collected LP_DEAD items including existing LP_DEAD items */
+ int lpdead_items;
+ OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
We are actually collecting dead offsets but the variable name says
'lpdead_items' instead of something like ndeadoffsets num_deadoffsets.
And the comment is also saying dead items.
5.
/*
* lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
* vacrel->dead_items array.
*
* Caller must have an exclusive buffer lock on the buffer (though a full
* cleanup lock is also acceptable). vmbuffer must be valid and already have
* a pin on blkno's visibility map page.
*
* index is an offset into the vacrel->dead_items array for the first listed
* LP_DEAD item on the page. The return value is the first index immediately
* after all LP_DEAD items for the same page in the array.
*/
This comment needs to be changed as this is referring to the
'vacrel->dead_items array' which no longer exists.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Mon, Jan 23, 2023 at 8:20 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:In v21, all of your v20 improvements to the radix tree template and test have been squashed into 0003, with one exception: v20-0010 (recursive freeing of shared mem), which I've attached separately (for flexibility) as v21-0006. I believe one of your earlier patches had a new DSA function for freeing memory more quickly -- was there a problem with that approach? I don't recall where that discussion went.
Hmm, I don't remember I proposed such a patch, either.
One idea to address it would be that we pass a shared memory to
RT_CREATE() and we create a DSA area dedicated to the radix tree in
place. We should return the created DSA area along with the radix tree
so that the caller can use it (e.g., for dsa_get_handle(), dsa_pin(),
and dsa_pin_mapping() etc). In RT_FREE(), we just detach from the DSA
area. A downside of this idea would be that one DSA area only for a
radix tree is always required.
Another idea would be that we allocate a big enough DSA area and
quarry small memory for nodes from there. But it would need to
introduce another complexity so I prefer to avoid it.
FYI the current design is inspired by dshash.c. In dshash_destory(),
we dsa_free() each elements allocated by dshash.c
+ * XXX: Most functions in this file have two variants for inner nodes and leaf + * nodes, therefore there are duplication codes. While this sometimes makes the + * code maintenance tricky, this reduces branch prediction misses when judging + * whether the node is a inner node of a leaf node.This comment seems to be out-of-date since we made it a template.
Done in 0020, along with a bunch of other comment editing.
The following macros are defined but not undefined in radixtree.h:
Fixed in v21-0018.
Also:
0007 makes the value type configurable. Some debug functionality still assumes integer type, but I think the rest is agnostic.
radixtree_search_impl.h still assumes that the value type is an
integer type as follows:
#ifdef RT_NODE_LEVEL_LEAF
RT_VALUE_TYPE value = 0;
Assert(RT_NODE_IS_LEAF(node));
#else
Also, I think if we make the value type configurable, it's better to
pass the pointer of the value to RT_SET() instead of copying the
values since the value size could be large.
0010 turns node4 into node3, as discussed, going from 48 bytes to 32.
0012 adopts the benchmark module to the template, and adds meson support (builds with warnings, but okay because not meant for commit).The rest are cleanups, small refactorings, and more comment rewrites. I've kept them separate for visibility. Next patch can squash them unless there is any discussion.
0008 patch
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
- fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize
%zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+ fprintf(stderr, "%s\tinner_size %zu\tleaf_size %zu\t%zu\n",
RT_SIZE_CLASS_INFO[i].name,
RT_SIZE_CLASS_INFO[i].inner_size,
- RT_SIZE_CLASS_INFO[i].inner_blocksize,
- RT_SIZE_CLASS_INFO[i].leaf_size,
- RT_SIZE_CLASS_INFO[i].leaf_blocksize);
+ RT_SIZE_CLASS_INFO[i].leaf_size);
There is additional '%zu' at the end of the format string:
---
0011 patch
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ * statments.
typo: s/statments/statements/
The rest look good to me. I'll incorporate these fixes in the next
version patch.
uint32 is how we store the block number, so this too small and will wrap around on overflow. int64 seems better.
Agreed, will fix.
Great, but it's now uint64, not int64. All the large counters in struct LVRelState, for example, are signed integers, as the usual practice. Unsigned ints are "usually" for things like bit patterns and where explicit wraparound is desired. There's probably more that can be done here to change to signed types, but I think it's still a bit early to get to that level of nitpicking. (Soon, I hope :-) )
Agreed. I'll change it in the next version patch.
+ * We calculate the maximum bytes for the TidStore in different ways + * for non-shared case and shared case. Please refer to the comment + * TIDSTORE_MEMORY_DEDUCT for details. + */Maybe the #define and comment should be close to here.
Will fix.
For this, I intended that "here" meant "in or just above the function".
+#define TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT (1024L * 70) /* 70kB */ +#define TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2 (float) 0.75 +#define TIDSTORE_SHARED_MAX_MEMORY_RATIO (float) 0.6These symbols are used only once, in tidstore_create(), and are difficult to read. That function has few comments. The symbols have several paragraphs, but they are far away. It might be better for readability to just hard-code numbers in the function, with the explanation about the numbers near where they are used.
Agreed, will fix.
+ * Destroy a TidStore, returning all memory. The caller must be certain that + * no other backend will attempt to access the TidStore before calling this + * function. Other backend must explicitly call tidstore_detach to free up + * backend-local memory associated with the TidStore. The backend that calls + * tidstore_destroy must not call tidstore_detach. + */ +void +tidstore_destroy(TidStore *ts)If not addressed by next patch, need to phrase comment with FIXME or TODO about making certain.
Will fix.
Did anything change here?
Oops, the fix is missed in the patch for some reason. I'll fix it.
There is also this, in the template, which I'm not sure has been addressed:
* XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
* has the local pointers to nodes, rather than RT_PTR_ALLOC.
* We need either a safeguard to disallow other processes to begin the iteration
* while one process is doing or to allow multiple processes to do the iteration.
It's not addressed yet. I think adding a safeguard is better for the
first version. A simple solution is to add a flag, say iter_active, to
allow only one process to enable the iteration. What do you think?
This part only runs "if (vacrel->nindexes == 0)", so seems like unneeded complexity. It arises because lazy_scan_prune() populates the tid store even if no index vacuuming happens. Perhaps the caller of lazy_scan_prune() could pass the deadoffsets array, and upon returning, either populate the store or call lazy_vacuum_heap_page(), as needed. It's quite possible I'm missing some detail, so some description of the design choices made would be helpful.
I agree that we don't need complexity here. I'll try this idea.
Keeping the offsets array in the prunestate seems to work out well.
Some other quick comments on tid store and vacuum, not comprehensive. Let me know if I've misunderstood something:
TID store:
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu + * + * X = bits used for offset number + * Y = bits used for block number + * u = unused bitI was confused for a while, and I realized the bits are in reverse order from how they are usually pictured (high on left, low on the right).
I borrowed it from ginpostinglist.c but it seems better to write in
the common order.
+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11 + * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with+ * XXX: if we want to support non-heap table AM that want to use the full + * range of possible offset numbers, we'll need to reconsider + * TIDSTORE_OFFSET_NBITS value.Would it be worth it (or possible) to calculate constants based on compile-time block size? And/or have a fallback for other table AMs? Since this file is in access/common, the intention is to allow general-purpose, I imagine.
I think we can pass the maximum offset numbers to tidstore_create()
and calculate these values.
+typedef dsa_pointer tidstore_handle;
It's not clear why we need a typedef here, since here:
+tidstore_attach(dsa_area *area, tidstore_handle handle) +{ + TidStore *ts; + dsa_pointer control; ... + control = handle;...there is a differently-named dsa_pointer variable that just gets the function parameter.
I guess one reason is to improve compatibility; we can stash the
actual value of the handle, which could help some cases, for example,
when we need to change the actual value of the handle. dshash.c uses
the same idea. Another reason would be to improve readability.
+/* Return the maximum memory TidStore can use */ +uint64 +tidstore_max_memory(TidStore *ts)size_t is more suitable for memory.
WIll fix.
+ /* + * Since the shared radix tree supports concurrent insert, + * we don't need to acquire the lock. + */Hmm? IIUC, the caller only acquires the lock after returning from here, to update statistics. Why is it safe to insert with no lock? Am I missing something?
You're right. I was missing something. The lock should be taken before
adding key-value pairs.
VACUUM integration:
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2 +#define PARALLEL_VACUUM_KEY_DSA 2Seems like unnecessary churn? It is still all about dead items, after all. I understand using "DSA" for the LWLock, since that matches surrounding code.
Agreed, will remove.
+#define HAS_LPDEAD_ITEMS(state) (((state).lpdead_items) > 0)
This macro helps the patch readability in some places, but I'm not sure it helps readability of the file as a whole. The following is in the patch and seems perfectly clear without the macro:
- if (lpdead_items > 0) + if (prunestate->lpdead_items > 0)
Will remove the macro.
About shared memory: I have some mild reservations about the naming of the "control object", which may be in shared memory. Is that an established term? (If so, disregard the rest): It seems backwards -- the thing in shared memory is the actual tree itself. The thing in backend-local memory has the "handle", and that's how we control the tree. I don't have a better naming scheme, though, and might not be that important. (Added a WIP comment)
That seems a valid concern. I borrowed the "control object" from
dshash.c but it supports only shared cases. The fact that the radix
tree supports both local and shared seems to introduce this confusion.
I came up with other names such as RT_RADIX_TREE_CORE or
RT_RADIX_TREE_ROOT but not sure these are better than the current
one.
Now might be a good time to look at earlier XXX comments and come up with a plan to address them.
Agreed.
Other XXX comments that are not mentioned yet are:
+ /* XXX: memory context support */
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
I'm not sure we really need memory context support for RT_ATTACH()
since in the shared case, we allocate backend-local memory only for
RT_RADIX_TREE.
---
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+ // XXX is this necessary?
+ Size total = sizeof(RT_RADIX_TREE);
Regarding this, I followed intset_memory_usage(). But in the radix
tree, RT_RADIX_TREE is very small so probably we can ignore it.
---
+/* XXX For display, assumes value type is numeric */
+static void
+RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
I think we can display values in hex encoded format but given the
value could be large, we don't necessarily need to display actual
values. Displaying the tree structure and chunks would be helpful for
debugging the radix tree.
---
There is no XXX comment but I'll try to add lock support in the next
version patch.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Wed, Jan 25, 2023 at 8:42 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Mon, Jan 23, 2023 at 8:20 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:In v21, all of your v20 improvements to the radix tree template and
test have been squashed into 0003, with one exception: v20-0010 (recursive
freeing of shared mem), which I've attached separately (for flexibility) as
v21-0006. I believe one of your earlier patches had a new DSA function for
freeing memory more quickly -- was there a problem with that approach? I
don't recall where that discussion went.
Hmm, I don't remember I proposed such a patch, either.
I went looking, and it turns out I remembered wrong, sorry.
One idea to address it would be that we pass a shared memory to
RT_CREATE() and we create a DSA area dedicated to the radix tree in
place. We should return the created DSA area along with the radix tree
so that the caller can use it (e.g., for dsa_get_handle(), dsa_pin(),
and dsa_pin_mapping() etc). In RT_FREE(), we just detach from the DSA
area. A downside of this idea would be that one DSA area only for a
radix tree is always required.Another idea would be that we allocate a big enough DSA area and
quarry small memory for nodes from there. But it would need to
introduce another complexity so I prefer to avoid it.FYI the current design is inspired by dshash.c. In dshash_destory(),
we dsa_free() each elements allocated by dshash.c
Okay, thanks for the info.
0007 makes the value type configurable. Some debug functionality still
assumes integer type, but I think the rest is agnostic.
radixtree_search_impl.h still assumes that the value type is an
integer type as follows:#ifdef RT_NODE_LEVEL_LEAF
RT_VALUE_TYPE value = 0;Assert(RT_NODE_IS_LEAF(node));
#elseAlso, I think if we make the value type configurable, it's better to
pass the pointer of the value to RT_SET() instead of copying the
values since the value size could be large.
Thanks, I will remove the assignment and look into pass-by-reference.
Oops, the fix is missed in the patch for some reason. I'll fix it.
There is also this, in the template, which I'm not sure has been
addressed:
* XXX: Currently we allow only one process to do iteration. Therefore,
rt_node_iter
* has the local pointers to nodes, rather than RT_PTR_ALLOC.
* We need either a safeguard to disallow other processes to begin the
iteration
* while one process is doing or to allow multiple processes to do the
iteration.
It's not addressed yet. I think adding a safeguard is better for the
first version. A simple solution is to add a flag, say iter_active, to
allow only one process to enable the iteration. What do you think?
I don't quite have enough info to offer an opinion, but this sounds like a
different form of locking. I'm sure it's come up before, but could you
describe why iteration is different from other operations, regarding
concurrency?
Would it be worth it (or possible) to calculate constants based on
compile-time block size? And/or have a fallback for other table AMs? Since
this file is in access/common, the intention is to allow general-purpose, I
imagine.
I think we can pass the maximum offset numbers to tidstore_create()
and calculate these values.
That would work easily for vacuumlazy.c, since it's in the "heap" subdir so
we know the max possible offset. I haven't looked at vacuumparallel.c, but
I can tell it is not in a heap-specific directory, so I don't know how easy
that would be to pass along the right value.
About shared memory: I have some mild reservations about the naming of
the "control object", which may be in shared memory. Is that an established
term? (If so, disregard the rest): It seems backwards -- the thing in
shared memory is the actual tree itself. The thing in backend-local memory
has the "handle", and that's how we control the tree. I don't have a better
naming scheme, though, and might not be that important. (Added a WIP
comment)
That seems a valid concern. I borrowed the "control object" from
dshash.c but it supports only shared cases. The fact that the radix
tree supports both local and shared seems to introduce this confusion.
I came up with other names such as RT_RADIX_TREE_CORE or
RT_RADIX_TREE_ROOT but not sure these are better than the current
one.
Okay, if dshash uses it, we have some precedent.
Now might be a good time to look at earlier XXX comments and come up
with a plan to address them.
Agreed.
Other XXX comments that are not mentioned yet are:
+ /* XXX: memory context support */ + tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));I'm not sure we really need memory context support for RT_ATTACH()
since in the shared case, we allocate backend-local memory only for
RT_RADIX_TREE.
Okay, we can remove this.
--- +RT_SCOPE uint64 +RT_MEMORY_USAGE(RT_RADIX_TREE *tree) +{ + // XXX is this necessary? + Size total = sizeof(RT_RADIX_TREE);Regarding this, I followed intset_memory_usage(). But in the radix
tree, RT_RADIX_TREE is very small so probably we can ignore it.
That was more a note to myself that I forgot about, so here is my
reasoning: In the shared case, we just overwrite that initial total, but
for the local case we add to it. A future reader could think this is
inconsistent and needs to be fixed. Since we deduct from the guc limit to
guard against worst-case re-allocation, and that deduction is not very
precise (nor needs to be), I agree we should just forget about tiny sizes
like this in both cases.
--- +/* XXX For display, assumes value type is numeric */ +static void +RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)I think we can display values in hex encoded format but given the
value could be large, we don't necessarily need to display actual
values. Displaying the tree structure and chunks would be helpful for
debugging the radix tree.
Okay, I can try that unless you do it first.
There is no XXX comment but I'll try to add lock support in the next
version patch.
Since there were calls to LWLockAcquire/Release in the last version, I'm a
bit confused by this. Perhaps for the next patch, the email should contain
a few sentences describing how locking is intended to work, including for
iteration.
Hmm, I wonder if we need to use the isolation tester. It's both a blessing
and a curse that the first client of this data structure is tid lookup.
It's a blessing because it doesn't present a highly-concurrent workload
mixing reads and writes and so simple locking is adequate. It's a curse
because to test locking and have any chance of finding bugs, we can't rely
on vacuum to tell us that because (as you've said) it might very well work
fine with no locking at all. So we must come up with test cases ourselves.
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Jan 24, 2023 at 1:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Mon, Jan 23, 2023 at 6:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:Attached is a rebase to fix conflicts from recent commits.
I have reviewed v22-0022* patch and I have some comments.
1.
It also changes to the column names max_dead_tuples and num_dead_tuples
and to
show the progress information in bytes.
I think this statement needs to be rephrased.
Could you be more specific?
3.
We are changing the min value of 'maintenance_work_mem' to 2MB. Should
we do the same for the 'autovacuum_work_mem'?
Yes, we should change that, too. We've discussed previously that
autovacuum_work_mem is possibly rendered unnecessary by this work, but we
agreed that that should be a separate thread. And needs additional testing
to verify.
I agree with your other comments.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Jan 26, 2023 at 3:54 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Wed, Jan 25, 2023 at 8:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Jan 23, 2023 at 8:20 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:In v21, all of your v20 improvements to the radix tree template and test have been squashed into 0003, with one exception: v20-0010 (recursive freeing of shared mem), which I've attached separately (for flexibility) as v21-0006. I believe one of your earlier patches had a new DSA function for freeing memory more quickly -- was there a problem with that approach? I don't recall where that discussion went.
Hmm, I don't remember I proposed such a patch, either.
I went looking, and it turns out I remembered wrong, sorry.
One idea to address it would be that we pass a shared memory to
RT_CREATE() and we create a DSA area dedicated to the radix tree in
place. We should return the created DSA area along with the radix tree
so that the caller can use it (e.g., for dsa_get_handle(), dsa_pin(),
and dsa_pin_mapping() etc). In RT_FREE(), we just detach from the DSA
area. A downside of this idea would be that one DSA area only for a
radix tree is always required.Another idea would be that we allocate a big enough DSA area and
quarry small memory for nodes from there. But it would need to
introduce another complexity so I prefer to avoid it.FYI the current design is inspired by dshash.c. In dshash_destory(),
we dsa_free() each elements allocated by dshash.cOkay, thanks for the info.
0007 makes the value type configurable. Some debug functionality still assumes integer type, but I think the rest is agnostic.
radixtree_search_impl.h still assumes that the value type is an
integer type as follows:#ifdef RT_NODE_LEVEL_LEAF
RT_VALUE_TYPE value = 0;Assert(RT_NODE_IS_LEAF(node));
#elseAlso, I think if we make the value type configurable, it's better to
pass the pointer of the value to RT_SET() instead of copying the
values since the value size could be large.Thanks, I will remove the assignment and look into pass-by-reference.
Oops, the fix is missed in the patch for some reason. I'll fix it.
There is also this, in the template, which I'm not sure has been addressed:
* XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
* has the local pointers to nodes, rather than RT_PTR_ALLOC.
* We need either a safeguard to disallow other processes to begin the iteration
* while one process is doing or to allow multiple processes to do the iteration.It's not addressed yet. I think adding a safeguard is better for the
first version. A simple solution is to add a flag, say iter_active, to
allow only one process to enable the iteration. What do you think?I don't quite have enough info to offer an opinion, but this sounds like a different form of locking. I'm sure it's come up before, but could you describe why iteration is different from other operations, regarding concurrency?
I think that we need to prevent concurrent updates (RT_SET() and
RT_DELETE()) during the iteration to get the consistent result through
the whole iteration operation. Unlike other operations such as
RT_SET(), we cannot expect that a job doing something for each
key-value pair in the radix tree completes in a short time, so we
cannot keep holding the radix tree lock until the end of the
iteration. So the idea is that we set iter_active to true (with the
lock in exclusive mode), and prevent concurrent updates when the flag
is true.
Would it be worth it (or possible) to calculate constants based on compile-time block size? And/or have a fallback for other table AMs? Since this file is in access/common, the intention is to allow general-purpose, I imagine.
I think we can pass the maximum offset numbers to tidstore_create()
and calculate these values.That would work easily for vacuumlazy.c, since it's in the "heap" subdir so we know the max possible offset. I haven't looked at vacuumparallel.c, but I can tell it is not in a heap-specific directory, so I don't know how easy that would be to pass along the right value.
I think the user (e.g, vacuumlazy.c) can pass the maximum offset
number to the parallel vacuum.
About shared memory: I have some mild reservations about the naming of the "control object", which may be in shared memory. Is that an established term? (If so, disregard the rest): It seems backwards -- the thing in shared memory is the actual tree itself. The thing in backend-local memory has the "handle", and that's how we control the tree. I don't have a better naming scheme, though, and might not be that important. (Added a WIP comment)
That seems a valid concern. I borrowed the "control object" from
dshash.c but it supports only shared cases. The fact that the radix
tree supports both local and shared seems to introduce this confusion.
I came up with other names such as RT_RADIX_TREE_CORE or
RT_RADIX_TREE_ROOT but not sure these are better than the current
one.Okay, if dshash uses it, we have some precedent.
Now might be a good time to look at earlier XXX comments and come up with a plan to address them.
Agreed.
Other XXX comments that are not mentioned yet are:
+ /* XXX: memory context support */ + tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));I'm not sure we really need memory context support for RT_ATTACH()
since in the shared case, we allocate backend-local memory only for
RT_RADIX_TREE.Okay, we can remove this.
--- +RT_SCOPE uint64 +RT_MEMORY_USAGE(RT_RADIX_TREE *tree) +{ + // XXX is this necessary? + Size total = sizeof(RT_RADIX_TREE);Regarding this, I followed intset_memory_usage(). But in the radix
tree, RT_RADIX_TREE is very small so probably we can ignore it.That was more a note to myself that I forgot about, so here is my reasoning: In the shared case, we just overwrite that initial total, but for the local case we add to it. A future reader could think this is inconsistent and needs to be fixed. Since we deduct from the guc limit to guard against worst-case re-allocation, and that deduction is not very precise (nor needs to be), I agree we should just forget about tiny sizes like this in both cases.
Thanks for your explanation, agreed.
--- +/* XXX For display, assumes value type is numeric */ +static void +RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)I think we can display values in hex encoded format but given the
value could be large, we don't necessarily need to display actual
values. Displaying the tree structure and chunks would be helpful for
debugging the radix tree.Okay, I can try that unless you do it first.
There is no XXX comment but I'll try to add lock support in the next
version patch.Since there were calls to LWLockAcquire/Release in the last version, I'm a bit confused by this. Perhaps for the next patch, the email should contain a few sentences describing how locking is intended to work, including for iteration.
The lock I'm thinking of adding is a simple readers-writer lock. This
lock is used for concurrent radix tree operations except for the
iteration. For operations concurrent to the iteration, I used a flag
for the reason I mentioned above.
Hmm, I wonder if we need to use the isolation tester. It's both a blessing and a curse that the first client of this data structure is tid lookup. It's a blessing because it doesn't present a highly-concurrent workload mixing reads and writes and so simple locking is adequate. It's a curse because to test locking and have any chance of finding bugs, we can't rely on vacuum to tell us that because (as you've said) it might very well work fine with no locking at all. So we must come up with test cases ourselves.
Using the isolation tester to test locking seems like a good idea. We
can include it in test_radixtree. But given that the locking in the
radix tree is very simple, the test case would be very simple. It may
be controversial whether it's worth adding such testing by adding both
the new test module and test cases.
I'm working on the fixes I mentioned in the previous email and going
to share the updated patch today. Please wait to do these fixes if
you're okay.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Thu, Jan 26, 2023 at 5:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I'm working on the fixes I mentioned in the previous email and going
to share the updated patch today. Please wait to do these fixes if
you're okay.
I've attached updated version patches. As we agreed I've merged your
changes in v22 into the main (0003) patch. But I still kept the patch
of recursively freeing nodes separate as we might need more
discussion. In v23 I attached, 0006 through 0016 patches are fixes and
improvements for the radix tree. I've incorporated all comments I got
unless I'm missing something.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v23-0016-Add-read-write-lock-to-radix-tree-in-RT_SHMEM-ca.patchapplication/octet-stream; name=v23-0016-Add-read-write-lock-to-radix-tree-in-RT_SHMEM-ca.patchDownload
From 730cdcba6c89954806ac40e2ed63720a93d3fe56 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 17:43:29 +0900
Subject: [PATCH v23 16/18] Add read-write lock to radix tree in RT_SHMEM case.
---
src/include/lib/radixtree.h | 100 +++++++++++++++++-
.../modules/test_radixtree/test_radixtree.c | 8 +-
2 files changed, 99 insertions(+), 9 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 11716fbfca..542daae6d0 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -40,6 +40,8 @@
* There are some optimizations not yet implemented, particularly path
* compression and lazy path expansion.
*
+ * WIP: describe about how locking works.
+ *
* WIP: the radix tree nodes don't shrink.
*
* To generate a radix tree and associated functions for a use case several
@@ -224,7 +226,7 @@ typedef dsa_pointer RT_HANDLE;
#endif
#ifdef RT_SHMEM
-RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
@@ -371,6 +373,16 @@ typedef struct RT_NODE
#define RT_INVALID_PTR_ALLOC NULL
#endif
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree) LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree) LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree) LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree) ((void) 0)
+#define RT_LOCK_SHARED(tree) ((void) 0)
+#define RT_UNLOCK(tree) ((void) 0)
+#endif
+
/*
* Inner nodes and leaf nodes have analogous structure. To distinguish
* them at runtime, we take advantage of the fact that the key chunk
@@ -596,6 +608,7 @@ typedef struct RT_RADIX_TREE_CONTROL
#ifdef RT_SHMEM
RT_HANDLE handle;
uint32 magic;
+ LWLock lock;
#endif
RT_PTR_ALLOC root;
@@ -1376,7 +1389,7 @@ RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC store
*/
RT_SCOPE RT_RADIX_TREE *
#ifdef RT_SHMEM
-RT_CREATE(MemoryContext ctx, dsa_area *dsa)
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
#else
RT_CREATE(MemoryContext ctx)
#endif
@@ -1398,6 +1411,7 @@ RT_CREATE(MemoryContext ctx)
tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
tree->ctl->handle = dp;
tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+ LWLockInitialize(&tree->ctl->lock, tranche_id);
#else
tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
@@ -1581,8 +1595,13 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
#endif
+ RT_LOCK_EXCLUSIVE(tree);
+
if (unlikely(tree->ctl->iter_active))
+ {
+ RT_UNLOCK(tree);
elog(ERROR, "cannot add new key-value to radix tree while iteration is in progress");
+ }
/* Empty tree, create the root */
if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
@@ -1609,6 +1628,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
{
RT_SET_EXTEND(tree, key, value, parent, stored_child, child);
+ RT_UNLOCK(tree);
return false;
}
@@ -1623,12 +1643,13 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
if (!updated)
tree->ctl->num_keys++;
+ RT_UNLOCK(tree);
return updated;
}
/*
* Search the given key in the radix tree. Return true if there is the key,
- * otherwise return false. On success, we set the value to *val_p so it must
+ * otherwise return false. On success, we set the value to *val_p so it must
* not be NULL.
*/
RT_SCOPE bool
@@ -1636,14 +1657,20 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
{
RT_PTR_LOCAL node;
int shift;
+ bool found;
#ifdef RT_SHMEM
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
#endif
Assert(value_p != NULL);
+ RT_LOCK_SHARED(tree);
+
if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
return false;
+ }
node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
shift = node->shift;
@@ -1657,13 +1684,19 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
break;
if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_UNLOCK(tree);
return false;
+ }
node = RT_PTR_GET_LOCAL(tree, child);
shift -= RT_NODE_SPAN;
}
- return RT_NODE_SEARCH_LEAF(node, key, value_p);
+ found = RT_NODE_SEARCH_LEAF(node, key, value_p);
+
+ RT_UNLOCK(tree);
+ return found;
}
#ifdef RT_USE_DELETE
@@ -1685,11 +1718,19 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
#endif
+ RT_LOCK_EXCLUSIVE(tree);
+
if (unlikely(tree->ctl->iter_active))
+ {
+ RT_UNLOCK(tree);
elog(ERROR, "cannot delete key to radix tree while iteration is in progress");
+ }
if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
return false;
+ }
/*
* Descend the tree to search the key while building a stack of nodes we
@@ -1708,7 +1749,10 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
node = RT_PTR_GET_LOCAL(tree, allocnode);
if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_UNLOCK(tree);
return false;
+ }
allocnode = child;
shift -= RT_NODE_SPAN;
@@ -1721,6 +1765,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
if (!deleted)
{
/* no key is found in the leaf node */
+ RT_UNLOCK(tree);
return false;
}
@@ -1732,7 +1777,10 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
* node.
*/
if (node->count > 0)
+ {
+ RT_UNLOCK(tree);
return true;
+ }
/* Free the empty leaf node */
RT_FREE_NODE(tree, allocnode);
@@ -1754,6 +1802,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
RT_FREE_NODE(tree, allocnode);
}
+ RT_UNLOCK(tree);
return true;
}
#endif
@@ -1827,8 +1876,13 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
RT_PTR_LOCAL root;
int top_level;
+ RT_LOCK_EXCLUSIVE(tree);
+
if (unlikely(tree->ctl->iter_active))
+ {
+ RT_UNLOCK(tree);
elog(ERROR, "cannot begin iteration while another iteration is in progress");
+ }
old_ctx = MemoryContextSwitchTo(tree->context);
@@ -1838,7 +1892,10 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
/* empty tree */
if (!iter->tree->ctl->root)
+ {
+ RT_UNLOCK(tree);
return iter;
+ }
root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
top_level = root->shift / RT_NODE_SPAN;
@@ -1852,11 +1909,12 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
MemoryContextSwitchTo(old_ctx);
+ RT_UNLOCK(tree);
return iter;
}
/*
- * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
* return false.
*/
RT_SCOPE bool
@@ -1864,9 +1922,14 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
{
Assert(iter->tree->ctl->iter_active);
+ RT_LOCK_SHARED(iter->tree);
+
/* Empty tree */
if (!iter->tree->ctl->root)
+ {
+ RT_UNLOCK(iter->tree);
return false;
+ }
for (;;)
{
@@ -1882,6 +1945,7 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
{
*key_p = iter->key;
*value_p = value;
+ RT_UNLOCK(iter->tree);
return true;
}
@@ -1899,7 +1963,10 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
/* the iteration finished */
if (!child)
+ {
+ RT_UNLOCK(iter->tree);
return false;
+ }
/*
* Set the node to the node iterator and update the iterator stack
@@ -1910,13 +1977,17 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
/* Node iterators are updated, so try again from the leaf */
}
+ RT_UNLOCK(iter->tree);
return false;
}
RT_SCOPE void
RT_END_ITERATE(RT_ITER *iter)
{
+ RT_LOCK_EXCLUSIVE(iter->tree);
iter->tree->ctl->iter_active = false;
+ RT_UNLOCK(iter->tree);
+
pfree(iter);
}
@@ -1928,6 +1999,8 @@ RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
{
Size total = 0;
+ RT_LOCK_SHARED(tree);
+
#ifdef RT_SHMEM
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
total = dsa_get_total_size(tree->dsa);
@@ -1939,6 +2012,7 @@ RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
}
#endif
+ RT_UNLOCK(tree);
return total;
}
@@ -2023,6 +2097,8 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
RT_SCOPE void
RT_STATS(RT_RADIX_TREE *tree)
{
+ RT_LOCK_SHARED(tree);
+
fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
@@ -2042,6 +2118,8 @@ RT_STATS(RT_RADIX_TREE *tree)
tree->ctl->cnt[RT_CLASS_125],
tree->ctl->cnt[RT_CLASS_256]);
}
+
+ RT_UNLOCK(tree);
}
static void
@@ -2235,14 +2313,18 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
RT_STATS(tree);
+ RT_LOCK_SHARED(tree);
+
if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
{
+ RT_UNLOCK(tree);
fprintf(stderr, "empty tree\n");
return;
}
if (key > tree->ctl->max_val)
{
+ RT_UNLOCK(tree);
fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
key, key);
return;
@@ -2276,6 +2358,7 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
shift -= RT_NODE_SPAN;
level++;
}
+ RT_UNLOCK(tree);
fprintf(stderr, "%s", buf.data);
}
@@ -2287,8 +2370,11 @@ RT_DUMP(RT_RADIX_TREE *tree)
RT_STATS(tree);
+ RT_LOCK_SHARED(tree);
+
if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
{
+ RT_UNLOCK(tree);
fprintf(stderr, "empty tree\n");
return;
}
@@ -2296,6 +2382,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
initStringInfo(&buf);
RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+ RT_UNLOCK(tree);
fprintf(stderr, "%s",buf.data);
}
@@ -2323,6 +2410,9 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_GET_KEY_CHUNK
#undef BM_IDX
#undef BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
#undef RT_NODE_IS_LEAF
#undef RT_NODE_MUST_GROW
#undef RT_NODE_KIND_COUNT
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 2a93e731ae..bbe1a619b6 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -144,7 +144,7 @@ test_empty(void)
dsa_area *dsa;
dsa = dsa_create(tranche_id);
- radixtree = rt_create(CurrentMemoryContext, dsa);
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
#else
radixtree = rt_create(CurrentMemoryContext);
#endif
@@ -195,7 +195,7 @@ test_basic(int children, bool test_inner)
test_inner ? "inner" : "leaf", children);
#ifdef RT_SHMEM
- radixtree = rt_create(CurrentMemoryContext, dsa);
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
#else
radixtree = rt_create(CurrentMemoryContext);
#endif
@@ -363,7 +363,7 @@ test_node_types(uint8 shift)
elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
#ifdef RT_SHMEM
- radixtree = rt_create(CurrentMemoryContext, dsa);
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
#else
radixtree = rt_create(CurrentMemoryContext);
#endif
@@ -434,7 +434,7 @@ test_pattern(const test_spec * spec)
MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
#ifdef RT_SHMEM
- radixtree = rt_create(radixtree_ctx, dsa);
+ radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
#else
radixtree = rt_create(radixtree_ctx);
#endif
--
2.31.1
v23-0014-Improve-RT_DUMP-and-RT_DUMP_SEARCH-output.patchapplication/octet-stream; name=v23-0014-Improve-RT_DUMP-and-RT_DUMP_SEARCH-output.patchDownload
From d13da75dfe46d9ea7776751134fe4c22f83cd15d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 16:52:41 +0900
Subject: [PATCH v23 14/18] Improve RT_DUMP() and RT_DUMP_SEARCH() output.
We don't display values since these might not be integers.
---
src/include/lib/radixtree.h | 201 +++++++++++++++++++++---------------
1 file changed, 118 insertions(+), 83 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index dbf9df604f..11716fbfca 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -2023,32 +2023,46 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
RT_SCOPE void
RT_STATS(RT_RADIX_TREE *tree)
{
- ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
- tree->ctl->num_keys,
- tree->ctl->root->shift / RT_NODE_SPAN,
- tree->ctl->cnt[RT_CLASS_3],
- tree->ctl->cnt[RT_CLASS_32_MIN],
- tree->ctl->cnt[RT_CLASS_32_MAX],
- tree->ctl->cnt[RT_CLASS_125],
- tree->ctl->cnt[RT_CLASS_256])));
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+ fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+ fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+
+ fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+ root->shift / RT_NODE_SPAN,
+ tree->ctl->cnt[RT_CLASS_3],
+ tree->ctl->cnt[RT_CLASS_32_MIN],
+ tree->ctl->cnt[RT_CLASS_32_MAX],
+ tree->ctl->cnt[RT_CLASS_125],
+ tree->ctl->cnt[RT_CLASS_256]);
+ }
}
-/* XXX For display, assumes value type is numeric */
static void
-RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
+RT_DUMP_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, int level,
+ bool recurse, StringInfo buf)
{
- char space[125] = {0};
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+ StringInfoData spaces;
- fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
- RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
- (node->kind == RT_NODE_KIND_3) ? 3 :
- (node->kind == RT_NODE_KIND_32) ? 32 :
- (node->kind == RT_NODE_KIND_125) ? 125 : 256,
- node->fanout == 0 ? 256 : node->fanout,
- node->count, node->shift);
+ initStringInfo(&spaces);
+ appendStringInfoSpaces(&spaces, (level * 4) + 1);
- if (level > 0)
- sprintf(space, "%*c", level * 4, ' ');
+ appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u, shift %u:\n",
+ spaces.data,
+ level == 0 ? "" : "-> ",
+ RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_3) ? 3 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift);
switch (node->kind)
{
@@ -2060,20 +2074,24 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
{
RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
- fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
- space, n3->base.chunks[i], (uint64) n3->values[i]);
+ appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+ spaces.data, i, n3->base.chunks[i]);
}
else
{
RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
- fprintf(stderr, "%schunk 0x%X ->",
- space, n3->base.chunks[i]);
+ appendStringInfo(buf, "%schunk[%d] 0x%X",
+ spaces.data, i, n3->base.chunks[i]);
if (recurse)
- RT_DUMP_NODE(n3->children[i], level + 1, recurse);
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, n3->children[i], level + 1,
+ recurse, buf);
+ }
else
- fprintf(stderr, "\n");
+ appendStringInfo(buf, " (skipped)\n");
}
}
break;
@@ -2086,22 +2104,25 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
{
RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
- fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
- space, n32->base.chunks[i], (uint64) n32->values[i]);
+ appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+ spaces.data, i, n32->base.chunks[i]);
}
else
{
RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
- fprintf(stderr, "%schunk 0x%X ->",
- space, n32->base.chunks[i]);
+ appendStringInfo(buf, "%schunk[%d] 0x%X",
+ spaces.data, i, n32->base.chunks[i]);
if (recurse)
{
- RT_DUMP_NODE(n32->children[i], level + 1, recurse);
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, n32->children[i], level + 1,
+ recurse, buf);
}
else
- fprintf(stderr, "\n");
+ appendStringInfo(buf, " (skipped)\n");
+
}
}
break;
@@ -2109,26 +2130,23 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
case RT_NODE_KIND_125:
{
RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+ char *sep = "";
- fprintf(stderr, "slot_idxs ");
+ appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
{
if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
continue;
- fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+ appendStringInfo(buf, "%s[%d]=%d ",
+ sep, i, b125->slot_idxs[i]);
+ sep = ",";
}
- if (RT_NODE_IS_LEAF(node))
- {
- RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
- fprintf(stderr, ", isset-bitmap:");
- for (int i = 0; i < BM_IDX(RT_SLOT_IDX_LIMIT); i++)
- {
- fprintf(stderr, RT_UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
- }
- fprintf(stderr, "\n");
- }
+ appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+ for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+ appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+ appendStringInfo(buf, "\n");
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
{
@@ -2136,30 +2154,39 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
continue;
if (RT_NODE_IS_LEAF(node))
- {
- RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
-
- fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
- space, i, (uint64) RT_NODE_LEAF_125_GET_VALUE(n125, i));
- }
+ appendStringInfo(buf, "%schunk 0x%X\n",
+ spaces.data, i);
else
{
RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
- fprintf(stderr, "%schunk 0x%X ->",
- space, i);
+ appendStringInfo(buf, "%schunk 0x%X",
+ spaces.data, i);
if (recurse)
- RT_DUMP_NODE(RT_NODE_INNER_125_GET_CHILD(n125, i),
- level + 1, recurse);
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i),
+ level + 1, recurse, buf);
+ }
else
- fprintf(stderr, "\n");
+ appendStringInfo(buf, " (skipped)\n");
}
}
break;
}
case RT_NODE_KIND_256:
{
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+ for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+ appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+ appendStringInfo(buf, "\n");
+ }
+
for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
{
if (RT_NODE_IS_LEAF(node))
@@ -2169,8 +2196,8 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
continue;
- fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
- space, i, (uint64) RT_NODE_LEAF_256_GET_VALUE(n256, i));
+ appendStringInfo(buf, "%schunk 0x%X\n",
+ spaces.data, i);
}
else
{
@@ -2179,14 +2206,17 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
continue;
- fprintf(stderr, "%schunk 0x%X ->",
- space, i);
+ appendStringInfo(buf, "%schunk 0x%X",
+ spaces.data, i);
if (recurse)
- RT_DUMP_NODE(RT_NODE_INNER_256_GET_CHILD(n256, i), level + 1,
- recurse);
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i),
+ level + 1, recurse, buf);
+ }
else
- fprintf(stderr, "\n");
+ appendStringInfo(buf, " (skipped)\n");
}
}
break;
@@ -2197,38 +2227,40 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
RT_SCOPE void
RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
{
+ RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL node;
+ StringInfoData buf;
int shift;
int level = 0;
- elog(NOTICE, "-----------------------------------------------------------");
- elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ")",
- tree->ctl->max_val, tree->ctl->max_val);
+ RT_STATS(tree);
- if (!tree->ctl->root)
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
{
- elog(NOTICE, "tree is empty");
+ fprintf(stderr, "empty tree\n");
return;
}
if (key > tree->ctl->max_val)
{
- elog(NOTICE, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val",
- key, key);
+ fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+ key, key);
return;
}
- node = tree->ctl->root;
- shift = tree->ctl->root->shift;
+ initStringInfo(&buf);
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
while (shift >= 0)
{
- RT_PTR_LOCAL child;
+ RT_PTR_ALLOC child;
- RT_DUMP_NODE(node, level, false);
+ RT_DUMP_NODE(tree, allocnode, level, false, &buf);
if (RT_NODE_IS_LEAF(node))
{
- uint64 dummy;
+ RT_VALUE_TYPE dummy;
/* We reached at a leaf node, find the corresponding slot */
RT_NODE_SEARCH_LEAF(node, key, &dummy);
@@ -2239,30 +2271,33 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
if (!RT_NODE_SEARCH_INNER(node, key, &child))
break;
- node = child;
+ allocnode = child;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
shift -= RT_NODE_SPAN;
level++;
}
+
+ fprintf(stderr, "%s", buf.data);
}
RT_SCOPE void
RT_DUMP(RT_RADIX_TREE *tree)
{
+ StringInfoData buf;
- for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
- fprintf(stderr, "%s\tinner_size %zu\tleaf_size %zu\n",
- RT_SIZE_CLASS_INFO[i].name,
- RT_SIZE_CLASS_INFO[i].inner_size,
- RT_SIZE_CLASS_INFO[i].leaf_size);
- fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+ RT_STATS(tree);
- if (!tree->ctl->root)
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
{
fprintf(stderr, "empty tree\n");
return;
}
- RT_DUMP_NODE(tree->ctl->root, 0, true);
+ initStringInfo(&buf);
+
+ RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+
+ fprintf(stderr, "%s",buf.data);
}
#endif
--
2.31.1
v23-0018-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchapplication/octet-stream; name=v23-0018-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload
From 0822ccf1c1df26abf50e865c62a69a302fcfc58f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 17 Jan 2023 17:20:37 +0700
Subject: [PATCH v23 18/18] Use TIDStore for storing dead tuple TID during lazy
vacuum
Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which was not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.
Now we use TIDStore to store dead tuple TIDs. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.
Since we are no longer able to exactly estimate the maximum number of
TIDs can be stored the pg_stat_progress_vacuum shows the progress
information based on the amount of memory in bytes. The column names
are also changed to max_dead_tuple_bytes and num_dead_tuple_bytes.
In addition, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, the inital DSA
segment size. Due to that, we increase the minimum value of
maintenance_work_mem (also autovacuum_work_mem) from 1MB to 2MB.
XXX: needs to bump catalog version
---
doc/src/sgml/monitoring.sgml | 8 +-
src/backend/access/heap/vacuumlazy.c | 218 +++++++--------------
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 78 +-------
src/backend/commands/vacuumparallel.c | 62 +++---
src/backend/postmaster/autovacuum.c | 6 +-
src/backend/storage/lmgr/lwlock.c | 2 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/commands/progress.h | 4 +-
src/include/commands/vacuum.h | 25 +--
src/include/storage/lwlock.h | 1 +
src/test/regress/expected/cluster.out | 2 +-
src/test/regress/expected/create_index.out | 2 +-
src/test/regress/expected/rules.out | 4 +-
src/test/regress/sql/cluster.sql | 2 +-
src/test/regress/sql/create_index.sql | 2 +-
16 files changed, 142 insertions(+), 278 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d936aa3da3..0230c74e3d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6870,10 +6870,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>max_dead_tuples</structfield> <type>bigint</type>
+ <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples that we can store before needing to perform
+ Amount of dead tuple data that we can store before needing to perform
an index vacuum cycle, based on
<xref linkend="guc-maintenance-work-mem"/>.
</para></entry>
@@ -6881,10 +6881,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>num_dead_tuples</structfield> <type>bigint</type>
+ <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples collected since the last index vacuum cycle.
+ Amount of dead tuple data collected since the last index vacuum cycle.
</para></entry>
</row>
</tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..3537df16fd 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TidStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -220,11 +221,14 @@ typedef struct LVRelState
typedef struct LVPagePruneState
{
bool hastup; /* Page prevents rel truncation? */
- bool has_lpdead_items; /* includes existing LP_DEAD items */
+
+ /* collected offsets of LP_DEAD items including existing ones */
+ OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+ int num_offsets;
/*
* State describes the proper VM bit states to set for the page following
- * pruning and freezing. all_visible implies !has_lpdead_items, but don't
+ * pruning and freezing. all_visible implies num_offsets == 0, but don't
* trust all_frozen result unless all_visible is also set to true.
*/
bool all_visible; /* Every item visible to all? */
@@ -259,8 +263,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -825,21 +830,21 @@ lazy_scan_heap(LVRelState *vacrel)
blkno,
next_unskippable_block,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TidStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
const int initprog_index[] = {
PROGRESS_VACUUM_PHASE,
PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
- PROGRESS_VACUUM_MAX_DEAD_TUPLES
+ PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
};
int64 initprog_val[3];
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +911,7 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ if (tidstore_is_full(vacrel->dead_items))
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -1018,7 +1022,7 @@ lazy_scan_heap(LVRelState *vacrel)
*/
lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
- Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+ Assert(!prunestate.all_visible || (prunestate.num_offsets == 0));
/* Remember the location of the last page with nonremovable tuples */
if (prunestate.hastup)
@@ -1034,14 +1038,12 @@ lazy_scan_heap(LVRelState *vacrel)
* performed here can be thought of as the one-pass equivalent of
* a call to lazy_vacuum().
*/
- if (prunestate.has_lpdead_items)
+ if (prunestate.num_offsets > 0)
{
Size freespace;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
- /* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+ prunestate.num_offsets, buf, vmbuffer);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1080,16 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(tidstore_num_tids(dead_items) == 0);
+ }
+ else if (prunestate.num_offsets > 0)
+ {
+ /* Save details of the LP_DEAD items from the page */
+ tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
+ prunestate.num_offsets);
+
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
}
/*
@@ -1145,7 +1156,7 @@ lazy_scan_heap(LVRelState *vacrel)
* There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
* set, however.
*/
- else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+ else if ((prunestate.num_offsets > 0) && PageIsAllVisible(page))
{
elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
vacrel->relname, blkno);
@@ -1193,7 +1204,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Final steps for block: drop cleanup lock, record free space in the
* FSM
*/
- if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+ if ((prunestate.num_offsets > 0) && vacrel->do_index_vacuuming)
{
/*
* Wait until lazy_vacuum_heap_rel() to save free space. This
@@ -1249,7 +1260,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (tidstore_num_tids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1543,13 +1554,11 @@ lazy_scan_prune(LVRelState *vacrel,
HTSV_Result res;
int tuples_deleted,
tuples_frozen,
- lpdead_items,
live_tuples,
recently_dead_tuples;
int nnewlpdead;
HeapPageFreeze pagefrz;
int64 fpi_before = pgWalUsage.wal_fpi;
- OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1571,7 +1580,6 @@ retry:
pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
tuples_deleted = 0;
tuples_frozen = 0;
- lpdead_items = 0;
live_tuples = 0;
recently_dead_tuples = 0;
@@ -1580,9 +1588,9 @@ retry:
*
* We count tuples removed by the pruning step as tuples_deleted. Its
* final value can be thought of as the number of tuples that have been
- * deleted from the table. It should not be confused with lpdead_items;
- * lpdead_items's final value can be thought of as the number of tuples
- * that were deleted from indexes.
+ * deleted from the table. It should not be confused with
+ * prunestate->deadoffsets; prunestate->deadoffsets's final value can
+ * be thought of as the number of tuples that were deleted from indexes.
*/
tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
InvalidTransactionId, 0, &nnewlpdead,
@@ -1593,7 +1601,7 @@ retry:
* requiring freezing among remaining tuples with storage
*/
prunestate->hastup = false;
- prunestate->has_lpdead_items = false;
+ prunestate->num_offsets = 0;
prunestate->all_visible = true;
prunestate->all_frozen = true;
prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1638,7 +1646,7 @@ retry:
* (This is another case where it's useful to anticipate that any
* LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
*/
- deadoffsets[lpdead_items++] = offnum;
+ prunestate->deadoffsets[prunestate->num_offsets++] = offnum;
continue;
}
@@ -1875,7 +1883,7 @@ retry:
*/
#ifdef USE_ASSERT_CHECKING
/* Note that all_frozen value does not matter when !all_visible */
- if (prunestate->all_visible && lpdead_items == 0)
+ if (prunestate->all_visible && prunestate->num_offsets == 0)
{
TransactionId cutoff;
bool all_frozen;
@@ -1888,28 +1896,9 @@ retry:
}
#endif
- /*
- * Now save details of the LP_DEAD items from the page in vacrel
- */
- if (lpdead_items > 0)
+ if (prunestate->num_offsets > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
-
vacrel->lpdead_item_pages++;
- prunestate->has_lpdead_items = true;
-
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
/*
* It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1928,7 +1917,7 @@ retry:
/* Finally, add page-local counts to whole-VACUUM counts */
vacrel->tuples_deleted += tuples_deleted;
vacrel->tuples_frozen += tuples_frozen;
- vacrel->lpdead_items += lpdead_items;
+ vacrel->lpdead_items += prunestate->num_offsets;
vacrel->live_tuples += live_tuples;
vacrel->recently_dead_tuples += recently_dead_tuples;
}
@@ -2129,8 +2118,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TidStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2127,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2198,7 +2179,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
return;
}
@@ -2227,7 +2208,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2254,8 +2235,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2300,7 +2281,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
}
/*
@@ -2373,7 +2354,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2410,10 +2391,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index = 0;
BlockNumber vacuumed_pages = 0;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2410,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
VACUUM_ERRCB_PHASE_VACUUM_HEAP,
InvalidBlockNumber, InvalidOffsetNumber);
- while (index < vacrel->dead_items->num_items)
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ while ((result = tidstore_iterate_next(iter)) != NULL)
{
BlockNumber blkno;
Buffer buf;
@@ -2437,7 +2420,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ blkno = result->blkno;
vacrel->blkno = blkno;
/*
@@ -2451,7 +2434,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+ buf, vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2461,6 +2445,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
vacuumed_pages++;
}
+ tidstore_end_iterate(iter);
vacrel->blkno = InvalidBlockNumber;
if (BufferIsValid(vmbuffer))
@@ -2470,36 +2455,30 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
}
/*
- * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- * vacrel->dead_items array.
+ * lazy_vacuum_heap_page() -- free page's LP_DEAD items.
*
* Caller must have an exclusive buffer lock on the buffer (though a full
* cleanup lock is also acceptable). vmbuffer must be valid and already have
* a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page. The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *deadoffsets, int num_offsets, Buffer buffer,
+ Buffer vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int nunused = 0;
@@ -2518,16 +2497,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = deadoffsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -3093,46 +3066,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
return vacrel->nonempty_pages;
}
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
/*
* Allocate dead_items (either using palloc, or in dynamic shared memory).
* Sets dead_items in vacrel for caller.
@@ -3143,11 +3076,9 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
+ int vac_work_mem = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
@@ -3174,7 +3105,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
+ vac_work_mem, MaxHeapTuplesPerPage,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3187,11 +3118,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = tidstore_create(vac_work_mem, MaxHeapTuplesPerPage,
+ NULL);
}
/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..a526e607fe 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1165,7 +1165,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
END AS phase,
S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
- S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+ S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7b1a4b127e..d8e680ca20 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int vac_cmp_itemptr(const void *left, const void *right);
/*
* Primary entry point for manual VACUUM and ANALYZE commands
@@ -2303,16 +2302,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TidStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ tidstore_num_tids(dead_items))));
return istat;
}
@@ -2343,82 +2342,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
* This has the right signature to be an IndexBulkDeleteCallback.
- *
- * Assumes dead_items array is sorted (in ascending TID order).
*/
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch((void *) itemptr,
- (void *) dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
-
- return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
- BlockNumber lblk,
- rblk;
- OffsetNumber loff,
- roff;
-
- lblk = ItemPointerGetBlockNumber((ItemPointer) left);
- rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
- if (lblk < rblk)
- return -1;
- if (lblk > rblk)
- return 1;
-
- loff = ItemPointerGetOffsetNumber((ItemPointer) left);
- roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
- if (loff < roff)
- return -1;
- if (loff > roff)
- return 1;
+ TidStore *dead_items = (TidStore *) state;
- return 0;
+ return tidstore_lookup_tid(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..5c7e6ed99c 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -103,6 +103,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TidStore */
+ tidstore_handle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ dsa_area *dead_items_area;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int max_offset, int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = tidstore_create(vac_work_mem, max_offset, dead_items_dsa);
+ pvs->dead_items = dead_items;
+ pvs->dead_items_area = dead_items_dsa;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = tidstore_get_handle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ tidstore_destroy(pvs->dead_items);
+ dsa_detach(pvs->dead_items_area);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TidStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ tidstore_detach(pvs.dead_items);
+ dsa_detach(dead_items_area);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index f5ea381c53..d88db3e1f8 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3397,12 +3397,12 @@ check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
return true;
/*
- * We clamp manually-set values to at least 1MB. Since
+ * We clamp manually-set values to at least 2MB. Since
* maintenance_work_mem is always set to at least this value, do the same
* here.
*/
- if (*newval < 1024)
- *newval = 1024;
+ if (*newval < 2048)
+ *newval = 2048;
return true;
}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 55b3a04097..c223a7dc94 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -192,6 +192,8 @@ static const char *const BuiltinTrancheNames[] = {
"LogicalRepLauncherDSA",
/* LWTRANCHE_LAUNCHER_HASH: */
"LogicalRepLauncherHash",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4ac808ed22..422914f0a9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2312,7 +2312,7 @@ struct config_int ConfigureNamesInt[] =
GUC_UNIT_KB
},
&maintenance_work_mem,
- 65536, 1024, MAX_KILOBYTES,
+ 65536, 2048, MAX_KILOBYTES,
NULL, NULL, NULL
},
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
#define PROGRESS_VACUUM_HEAP_BLKS_SCANNED 2
#define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED 3
#define PROGRESS_VACUUM_NUM_INDEX_VACUUMS 4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES 5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES 5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES 6
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..a3ebb169ef 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
MultiXactId MultiXactCutoff;
};
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TidStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int vac_work_mem, int max_offset,
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 07002fdfbe..537b34b30c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DATA,
LWTRANCHE_LAUNCHER_DSA,
LWTRANCHE_LAUNCHER_HASH,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
-- ensure we don't use the index in CLUSTER nor the checking SELECTs
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
-- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..f6ae02eb14 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param3 AS heap_blks_scanned,
s.param4 AS heap_blks_vacuumed,
s.param5 AS index_vacuum_count,
- s.param6 AS max_dead_tuples,
- s.param7 AS num_dead_tuples
+ s.param6 AS max_dead_tuple_bytes,
+ s.param7 AS dead_tuple_bytes
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
--
2.31.1
v23-0017-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchapplication/octet-stream; name=v23-0017-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload
From 32ccdca354e5d9e82f8be512e3afc65ee9930f2a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v23 17/18] Add TIDStore, to store sets of TIDs
(ItemPointerData) efficiently.
The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.
The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.
This includes a unit test module, in src/test/modules/test_tidstore.
---
doc/src/sgml/monitoring.sgml | 4 +
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 674 ++++++++++++++++++
src/backend/storage/lmgr/lwlock.c | 2 +
src/include/access/tidstore.h | 49 ++
src/include/storage/lwlock.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_tidstore/Makefile | 23 +
.../test_tidstore/expected/test_tidstore.out | 13 +
src/test/modules/test_tidstore/meson.build | 35 +
.../test_tidstore/sql/test_tidstore.sql | 7 +
.../test_tidstore/test_tidstore--1.0.sql | 8 +
.../modules/test_tidstore/test_tidstore.c | 195 +++++
.../test_tidstore/test_tidstore.control | 4 +
16 files changed, 1019 insertions(+)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
create mode 100644 src/test/modules/test_tidstore/Makefile
create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
create mode 100644 src/test/modules/test_tidstore/meson.build
create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
create mode 100644 src/test/modules/test_tidstore/test_tidstore.control
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..d936aa3da3 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2192,6 +2192,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Waiting to access a shared TID bitmap during a parallel bitmap
index scan.</entry>
</row>
+ <row>
+ <entry><literal>SharedTidStore</literal></entry>
+ <entry>Waiting to access a shared TID store.</entry>
+ </row>
<row>
<entry><literal>SharedTupleStore</literal></entry>
<entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..89aea71945
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,674 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach().
+ *
+ * XXX: Only one process is allowed to iterate over the TidStore at a time.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, tids are represented as a pair of 64-bit key and
+ * 64-bit value. First, we construct 64-bit unsigned integer by combining
+ * the block number and the offset number. The number of bits used for the
+ * offset number is specified by max_offsets in tidstore_create(). We are
+ * frugal with the bits, because smaller keys could help keeping the radix
+ * tree shallow.
+ *
+ * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. That
+ * is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits
+ * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
+ * as the key:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ * |----| value
+ * |---------------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ *
+ * If the number of bits for offset number fits in a 64-bit value, we don't
+ * encode tids but directly use the block number and the offset number as key
+ * and value, respectively.
+ */
+#define TIDSTORE_VALUE_NBITS 6 /* log(64, 2) */
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The header object for a TidStore */
+typedef struct TidStoreControl
+{
+ int64 num_tids; /* the number of Tids stored so far */
+ size_t max_bytes; /* the maximum bytes a TidStore can use */
+ int max_offset; /* the maximum offset number */
+ bool encode_tids; /* do we use tid encoding? */
+ int offset_nbits; /* the number of bits used for offset number */
+ int offset_key_nbits; /* the number of bits of a offset number
+ * used for the key */
+
+ /* The below fields are used only in shared case */
+
+ uint32 magic;
+ LWLock lock;
+
+ /* handles for TidStore and radix tree */
+ tidstore_handle handle;
+ shared_rt_handle tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+ /*
+ * Control object. This is allocated in DSA area 'area' in the shared
+ * case, otherwise in backend-local memory.
+ */
+ TidStoreControl *control;
+
+ /* Storage for Tids. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ local_rt_radix_tree *local;
+ shared_rt_radix_tree *shared;
+ } tree;
+
+ /* DSA area for TidStore if used */
+ dsa_area *area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+ TidStore *ts;
+
+ /* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ shared_rt_iter *shared;
+ local_rt_iter *local;
+ } tree_iter;
+
+ /* we returned all tids? */
+ bool finished;
+
+ /* save for the next iteration */
+ uint64 next_key;
+ uint64 next_val;
+
+ /* output for the caller */
+ TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+{
+ TidStore *ts;
+
+ ts = palloc0(sizeof(TidStore));
+
+ /*
+ * Create the radix tree for the main storage.
+ *
+ * Memory consumption depends on the number of Tids stored, but also on the
+ * distribution of them, how the radix tree stores, and the memory management
+ * that backed the radix tree. The maximum bytes that a TidStore can
+ * use is specified by the max_bytes in tidstore_create(). We want the total
+ * amount of memory consumption not to exceed the max_bytes.
+ *
+ * In non-shared cases, the radix tree uses slab allocators for each kind of
+ * node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate the
+ * largest radix tree node in a new slab block, which is approximately 70kB.
+ * Therefore, we deduct 70kB from the maximum bytes.
+ *
+ * In shared cases, DSA allocates the memory segments big enough to follow
+ * a geometric series that approximately doubles the total DSA size (see
+ * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+ * size and the simulation revealed, the 75% threshold for the maximum bytes
+ * perfectly works in case where it is a power-of-2, and the 60% threshold
+ * works for other cases.
+ */
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+ float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+ LWTRANCHE_SHARED_TIDSTORE);
+
+ dp = dsa_allocate0(area, sizeof(TidStoreControl));
+ ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+ ts->control->max_bytes =(uint64) (max_bytes * ratio);
+ ts->area = area;
+
+ ts->control->magic = TIDSTORE_MAGIC;
+ LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+ ts->control->handle = dp;
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+ }
+ else
+ {
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+ ts->control->max_bytes = max_bytes - (1024 * 70);
+ }
+
+ ts->control->max_offset = max_offset;
+ ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+
+ if (ts->control->offset_nbits > TIDSTORE_VALUE_NBITS)
+ {
+ ts->control->encode_tids = true;
+ ts->control->offset_key_nbits =
+ ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+ }
+ else
+ {
+ ts->control->encode_tids = false;
+ ts->control->offset_key_nbits = 0;
+ }
+
+ return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ /* create per-backend state */
+ ts = palloc0(sizeof(TidStore));
+
+ /* Find the control object in shared memory */
+ control = handle;
+
+ /* Set up the TidStore */
+ ts->control = (TidStoreControl *) dsa_get_address(area, control);
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+ ts->area = area;
+
+ return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ shared_rt_detach(ts->tree.shared);
+ pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix
+ * tree.
+ */
+ ts->control->magic = 0;
+ dsa_free(ts->area, ts->control->handle);
+ shared_rt_free(ts->tree.shared);
+ }
+ else
+ {
+ pfree(ts->control);
+ local_rt_free(ts->tree.local);
+ }
+
+ pfree(ts);
+}
+
+/* Forget all collected Tids */
+void
+tidstore_reset(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /*
+ * Free the radix tree and return allocated DSA segments to
+ * the operating system.
+ */
+ shared_rt_free(ts->tree.shared);
+ dsa_trim(ts->area);
+
+ /* Recreate the radix tree */
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+ LWTRANCHE_SHARED_TIDSTORE);
+
+ /* update the radix tree handle as we recreated it */
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+
+ LWLockRelease(&ts->control->lock);
+ }
+ else
+ {
+ local_rt_free(ts->tree.local);
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+ }
+}
+
+static inline void
+tidstore_insert_kv(TidStore *ts, uint64 key, uint64 val)
+{
+ if (TidStoreIsShared(ts))
+ shared_rt_set(ts->tree.shared, key, val);
+ else
+ local_rt_set(ts->tree.local, key, val);
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ ItemPointerData tid;
+ uint64 key_base;
+ uint64 *values;
+ int nkeys;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ ItemPointerSetBlockNumber(&tid, blkno);
+
+ if (ts->control->encode_tids)
+ {
+ key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+ nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+ }
+ else
+ {
+ key_base = (uint64) blkno;
+ nkeys = 1;
+ }
+
+ values = palloc0(sizeof(uint64) * nkeys);
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint64 key;
+ uint32 off;
+ int idx;
+
+ ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+ /* encode the tid to key and val */
+ key = tid_to_key_off(ts, &tid, &off);
+
+ idx = key - key_base;
+ Assert(idx >= 0 && idx < nkeys);
+
+ values[idx] |= UINT64CONST(1) << off;
+ }
+
+ if (TidStoreIsShared(ts))
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /* insert the calculated key-values to the tree */
+ for (int i = 0; i < nkeys; i++)
+ {
+ if (values[i])
+ {
+ uint64 key = key_base + i;
+
+ tidstore_insert_kv(ts, key, values[i]);
+ }
+ }
+
+ /* update statistics */
+ ts->control->num_tids += num_offsets;
+
+ if (TidStoreIsShared(ts))
+ LWLockRelease(&ts->control->lock);
+
+ pfree(values);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val = 0;
+ uint32 off;
+ bool found;
+
+ key = tid_to_key_off(ts, tid, &off);
+
+ if (TidStoreIsShared(ts))
+ found = shared_rt_search(ts->tree.shared, key, &val);
+ else
+ found = local_rt_search(ts->tree.local, key, &val);
+
+ if (!found)
+ return false;
+
+ return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. The caller must be certain that
+ * no other backend will attempt to update the TidStore during the iteration.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+ TidStoreIter *iter;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ iter = palloc0(sizeof(TidStoreIter));
+ iter->ts = ts;
+
+ iter->result.blkno = InvalidBlockNumber;
+ iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
+
+ if (TidStoreIsShared(ts))
+ iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+ else
+ iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+ /* If the TidStore is empty, there is no business */
+ if (tidstore_num_tids(ts) == 0)
+ iter->finished = true;
+
+ return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+ else
+ return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a TidStoreIterResult representing Tids
+ * in one page. Offset numbers in the result is sorted.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+ TidStoreIterResult *result = &(iter->result);
+
+ if (iter->finished)
+ return NULL;
+
+ if (BlockNumberIsValid(result->blkno))
+ {
+ result->num_offsets = 0;
+ tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (tidstore_iter_kv(iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = key_get_blkno(iter->ts, key);
+
+ if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+ {
+ /*
+ * Remember the key-value pair for the next block for the
+ * next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+ return result;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_extract_tids(iter, key, val);
+ }
+
+ iter->finished = true;
+ return result;
+}
+
+/* Finish an iteration over TidStore */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+ if (TidStoreIsShared(iter->ts))
+ shared_rt_end_iterate(iter->tree_iter.shared);
+ else
+ local_rt_end_iterate(iter->tree_iter.local);
+
+ pfree(iter->result.offsets);
+ pfree(iter);
+}
+
+/* Return the number of Tids we collected so far */
+int64
+tidstore_num_tids(TidStore *ts)
+{
+ uint64 num_tids;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ if (TidStoreIsShared(ts))
+ return ts->control->num_tids;
+
+ LWLockAcquire(&ts->control->lock, LW_SHARED);
+ num_tids = ts->control->num_tids;
+ LWLockRelease(&ts->control->lock);
+
+ return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+tidstore_max_memory(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+tidstore_memory_usage(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+ else
+ return sizeof(TidStore) + sizeof(TidStore) +
+ local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->handle;
+}
+
+/* Extract Tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+ TidStoreIterResult *result = (&iter->result);
+
+ for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ if (i > iter->ts->control->max_offset)
+ break;
+
+ if ((val & (UINT64CONST(1) << i)) == 0)
+ continue;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= i;
+
+ off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+
+ Assert(result->num_offsets < iter->ts->control->max_offset);
+ result->offsets[result->num_offsets++] = off;
+ }
+
+ result->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, uint64 key)
+{
+ if (ts->control->encode_tids)
+ return (BlockNumber) (key >> ts->control->offset_key_nbits);
+ else
+ return (BlockNumber) key;
+}
+
+/* Encode a tid to key and offset */
+static inline uint64
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off)
+{
+ uint64 key;
+ uint64 tid_i;
+
+ if (!ts->control->encode_tids)
+ {
+ *off = ItemPointerGetOffsetNumber(tid);
+
+ /* Use the block number as the key */
+ return (int64) ItemPointerGetBlockNumber(tid);
+ }
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << ts->control->offset_nbits;
+
+ *off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1);
+ key = tid_i >> TIDSTORE_VALUE_NBITS;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
"SharedTupleStore",
/* LWTRANCHE_SHARED_TIDBITMAP: */
"SharedTidBitmap",
+ /* LWTRANCHE_SHARED_TIDSTORE: */
+ "SharedTidStore",
/* LWTRANCHE_PARALLEL_APPEND: */
"ParallelAppend",
/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..a35a52124a
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+ BlockNumber blkno;
+ OffsetNumber *offsets;
+ int num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern int64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern size_t tidstore_max_memory(TidStore *ts);
+extern size_t tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif /* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
LWTRANCHE_SHARED_TUPLESTORE,
LWTRANCHE_SHARED_TIDBITMAP,
+ LWTRANCHE_SHARED_TIDSTORE,
LWTRANCHE_PARALLEL_APPEND,
LWTRANCHE_PER_XACT_PREDICATE_LIST,
LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
test_rls_hooks \
test_shm_mq \
test_slru \
+ test_tidstore \
unsafe_tests \
worker_spi
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
subdir('test_rls_hooks')
subdir('test_shm_mq')
subdir('test_slru')
+subdir('test_tidstore')
subdir('unsafe_tests')
subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+ $(WIN32RES) \
+ test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE: testing empty tidstore
+NOTICE: testing basic operations
+ test_tidstore
+---------------
+
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+ 'test_tidstore.c',
+)
+
+if host_system == 'windows'
+ test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_tidstore',
+ '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+ test_tidstore_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+ 'test_tidstore.control',
+ 'test_tidstore--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_tidstore',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_tidstore',
+ ],
+ },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..9b849ae8e8
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,195 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ * Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+ ItemPointerData tid;
+ bool found;
+
+ ItemPointerSet(&tid, blkno, off);
+
+ found = tidstore_lookup_tid(ts, &tid);
+
+ if (found != expect)
+ elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+ blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS 5
+#define TEST_TIDSTORE_NUM_OFFSETS 5
+
+ TidStore *ts;
+ TidStoreIter *iter;
+ TidStoreIterResult *iter_result;
+ BlockNumber blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+ };
+ BlockNumber blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+ };
+ OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+ int blk_idx;
+
+ /* prepare the offset array */
+ offs[0] = FirstOffsetNumber;
+ offs[1] = FirstOffsetNumber + 1;
+ offs[2] = max_offset / 2;
+ offs[3] = max_offset - 1;
+ offs[4] = max_offset;
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+
+ /* add tids */
+ for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+ tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* lookup test */
+ for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+ {
+ bool expect = false;
+ for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+ {
+ if (offs[i] == off)
+ {
+ expect = true;
+ break;
+ }
+ }
+
+ check_tid(ts, 0, off, expect);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, expect);
+ }
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+ elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+ tidstore_num_tids(ts),
+ TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* iteration test */
+ iter = tidstore_begin_iterate(ts);
+ blk_idx = 0;
+ while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+ {
+ /* check the returned block number */
+ if (blks_sorted[blk_idx] != iter_result->blkno)
+ elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+ iter_result->blkno, blks_sorted[blk_idx]);
+
+ /* check the returned offset numbers */
+ if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+ elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+ iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+ for (int i = 0; i < iter_result->num_offsets; i++)
+ {
+ if (offs[i] != iter_result->offsets[i])
+ elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+ iter_result->offsets[i], iter_result->blkno, offs[i]);
+ }
+
+ blk_idx++;
+ }
+
+ if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+ elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+ blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+ /* remove all tids */
+ tidstore_reset(ts);
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+ /* lookup test for empty store */
+ for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+ off++)
+ {
+ check_tid(ts, 0, off, false);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, false);
+ }
+
+ tidstore_destroy(ts);
+}
+
+static void
+test_empty(void)
+{
+ TidStore *ts;
+ TidStoreIter *iter;
+ ItemPointerData tid;
+
+ elog(NOTICE, "testing empty tidstore");
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+
+ ItemPointerSet(&tid, 0, FirstOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+ ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+ MaxBlockNumber, MaxOffsetNumber);
+
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+ if (tidstore_is_full(ts))
+ elog(ERROR, "tidstore_is_full on empty store returned true");
+
+ iter = tidstore_begin_iterate(ts);
+
+ if (tidstore_iterate_next(iter) != NULL)
+ elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+ tidstore_end_iterate(iter);
+
+ tidstore_destroy(ts);
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ elog(NOTICE, "testing basic operations");
+ test_basic(MaxHeapTuplesPerPage);
+ test_basic(10);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
--
2.31.1
v23-0015-Detach-DSA-after-tests-in-test_radixtree.patchapplication/octet-stream; name=v23-0015-Detach-DSA-after-tests-in-test_radixtree.patchDownload
From 139100053c485f7ade6117e42ab6567dd94bdd76 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 17:43:04 +0900
Subject: [PATCH v23 15/18] Detach DSA after tests in test_radixtree.
---
src/test/modules/test_radixtree/test_radixtree.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 64d46dfe9a..2a93e731ae 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -172,6 +172,10 @@ test_empty(void)
rt_end_iterate(iter);
rt_free(radixtree);
+
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
}
static void
@@ -243,6 +247,9 @@ test_basic(int children, bool test_inner)
pfree(keys);
rt_free(radixtree);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
}
/*
@@ -371,6 +378,9 @@ test_node_types(uint8 shift)
test_node_types_insert(radixtree, shift, false);
rt_free(radixtree);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
}
/*
@@ -636,6 +646,9 @@ test_pattern(const test_spec * spec)
rt_free(radixtree);
MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
}
Datum
--
2.31.1
v23-0013-Remove-XXX-comment-for-MemoryContext-support-for.patchapplication/octet-stream; name=v23-0013-Remove-XXX-comment-for-MemoryContext-support-for.patchDownload
From 8e5ca0c31972bde4e0d64b76ef8cfee599af1044 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 11:07:38 +0900
Subject: [PATCH v23 13/18] Remove XXX comment for MemoryContext support for
RT_ATTACH() as discussed.
---
src/include/lib/radixtree.h | 1 -
1 file changed, 1 deletion(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index e9ff3aa05d..dbf9df604f 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1433,7 +1433,6 @@ RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
RT_RADIX_TREE *tree;
dsa_pointer control;
- /* XXX: memory context support */
tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
/* Find the control object in shard memory */
--
2.31.1
v23-0011-Add-a-safeguard-for-concurrent-iteration-in-RT_S.patchapplication/octet-stream; name=v23-0011-Add-a-safeguard-for-concurrent-iteration-in-RT_S.patchDownload
From 56c458643f58723c59ed28477f6d129374a59e6c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 11:04:41 +0900
Subject: [PATCH v23 11/18] Add a safeguard for concurrent iteration in
RT_SHMEM case.
---
src/include/lib/radixtree.h | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 003e8215aa..0277d5e6fb 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -602,6 +602,9 @@ typedef struct RT_RADIX_TREE_CONTROL
uint64 max_val;
uint64 num_keys;
+ /* is iteration in progress? */
+ bool iter_active;
+
/* statistics */
#ifdef RT_DEBUG
int32 cnt[RT_SIZE_CLASS_COUNT];
@@ -638,10 +641,7 @@ typedef struct RT_RADIX_TREE
* advancing the current index within the node or when moving to the next node
* at the same level.
*
- * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
- * has the local pointers to nodes, rather than RT_PTR_ALLOC.
- * We need either a safeguard to disallow other processes to begin the iteration
- * while one process is doing or to allow multiple processes to do the iteration.
+ * In RT_SHMEM case, only one process is allowed to do iteration.
*/
typedef struct RT_NODE_ITER
{
@@ -1582,6 +1582,9 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
#endif
+ if (unlikely(tree->ctl->iter_active))
+ elog(ERROR, "cannot add new key-value to radix tree while iteration is in progress");
+
/* Empty tree, create the root */
if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
RT_NEW_ROOT(tree, key);
@@ -1683,6 +1686,9 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
#endif
+ if (unlikely(tree->ctl->iter_active))
+ elog(ERROR, "cannot delete key to radix tree while iteration is in progress");
+
if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
return false;
@@ -1822,10 +1828,14 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
RT_PTR_LOCAL root;
int top_level;
+ if (unlikely(tree->ctl->iter_active))
+ elog(ERROR, "cannot begin iteration while another iteration is in progress");
+
old_ctx = MemoryContextSwitchTo(tree->context);
iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
iter->tree = tree;
+ tree->ctl->iter_active = true;
/* empty tree */
if (!iter->tree->ctl->root)
@@ -1853,6 +1863,8 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
RT_SCOPE bool
RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
{
+ Assert(iter->tree->ctl->iter_active);
+
/* Empty tree */
if (!iter->tree->ctl->root)
return false;
@@ -1905,6 +1917,7 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
RT_SCOPE void
RT_END_ITERATE(RT_ITER *iter)
{
+ iter->tree->ctl->iter_active = false;
pfree(iter);
}
--
2.31.1
v23-0010-Fix-a-typo-in-simd.h.patchapplication/octet-stream; name=v23-0010-Fix-a-typo-in-simd.h.patchDownload
From d8b39122cea6ca7363b0ae6d96d99bd018a264c4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 10:51:12 +0900
Subject: [PATCH v23 10/18] Fix a typo in simd.h
---
src/include/port/simd.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 84d41a340a..f0bba33c53 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -280,7 +280,7 @@ vector8_is_highbit_set(const Vector8 v)
}
/*
- * Return the bitmak of the high-bit of each element.
+ * Return the bitmask of the high-bit of each element.
*/
static inline uint32
vector8_highbit_mask(const Vector8 v)
--
2.31.1
v23-0009-Miscellaneous-fixes.patchapplication/octet-stream; name=v23-0009-Miscellaneous-fixes.patchDownload
From 222e13f6e19baa6189c25167d0f20919230842c3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 10:50:33 +0900
Subject: [PATCH v23 09/18] Miscellaneous fixes.
---
src/include/lib/radixtree.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index b389ee3ed3..003e8215aa 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -304,7 +304,7 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
* XXX There are 4 node kinds, and this should never be increased,
* for several reasons:
* 1. With 5 or more kinds, gcc tends to use a jump table for switch
- * statments.
+ * statements.
* 2. The 4 kinds can be represented with 2 bits, so we have the option
* in the future to tag the node pointer with the kind, even on
* platforms with 32-bit pointers. This might speed up node traversal
@@ -2239,7 +2239,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
{
for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
- fprintf(stderr, "%s\tinner_size %zu\tleaf_size %zu\t%zu\n",
+ fprintf(stderr, "%s\tinner_size %zu\tleaf_size %zu\n",
RT_SIZE_CLASS_INFO[i].name,
RT_SIZE_CLASS_INFO[i].inner_size,
RT_SIZE_CLASS_INFO[i].leaf_size);
--
2.31.1
v23-0012-Don-t-include-the-size-of-RT_RADIX_TREE-to-memor.patchapplication/octet-stream; name=v23-0012-Don-t-include-the-size-of-RT_RADIX_TREE-to-memor.patchDownload
From b90e3412b94bfc5bf8de7e2f1e6a0fe286075f52 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 11:06:30 +0900
Subject: [PATCH v23 12/18] Don't include the size of RT_RADIX_TREE to memory
usage as discussed.
---
src/include/lib/radixtree.h | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 0277d5e6fb..e9ff3aa05d 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1927,8 +1927,7 @@ RT_END_ITERATE(RT_ITER *iter)
RT_SCOPE uint64
RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
{
- // XXX is this necessary?
- Size total = sizeof(RT_RADIX_TREE);
+ Size total = 0;
#ifdef RT_SHMEM
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
--
2.31.1
v23-0008-Align-indents-of-the-file-header-comments.patchapplication/octet-stream; name=v23-0008-Align-indents-of-the-file-header-comments.patchDownload
From 6c08547c8d6b56ff7ff4a686cab863d58c6a16e6 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 10:49:17 +0900
Subject: [PATCH v23 08/18] Align indents of the file header comments.
---
src/include/lib/radixtree.h | 36 ++++++++++++++++++------------------
1 file changed, 18 insertions(+), 18 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 6852cb0b45..b389ee3ed3 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -42,25 +42,25 @@
*
* WIP: the radix tree nodes don't shrink.
*
- * To generate a radix tree and associated functions for a use case several
- * macros have to be #define'ed before this file is included. Including
- * the file #undef's all those, so a new radix tree can be generated
- * afterwards.
- * The relevant parameters are:
- * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
- * will result in radix tree type 'foo_radix_tree' and functions like
- * 'foo_create'/'foo_free' and so forth.
- * - RT_DECLARE - if defined function prototypes and type declarations are
- * generated
- * - RT_DEFINE - if defined function definitions are generated
- * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
- * declarations reside
- * - RT_VALUE_TYPE - the type of the value.
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included. Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * will result in radix tree type 'foo_radix_tree' and functions like
+ * 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ * generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ * declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
*
- * Optional parameters:
- * - RT_SHMEM - if defined, the radix tree is created in the DSA area
- * so that multiple processes can access it simultaneously.
- * - RT_DEBUG - if defined add stats tracking and debugging functions
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ * so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
*
* Interface
* ---------
--
2.31.1
v23-0007-undef-RT_SLOT_IDX_LIMIT.patchapplication/octet-stream; name=v23-0007-undef-RT_SLOT_IDX_LIMIT.patchDownload
From 31742053ef1824698e0ae0c3a059eb2f06164522 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 10:45:16 +0900
Subject: [PATCH v23 07/18] undef RT_SLOT_IDX_LIMIT.
---
src/include/lib/radixtree.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 7fcd212ea4..6852cb0b45 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -2281,6 +2281,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_NODE_MUST_GROW
#undef RT_NODE_KIND_COUNT
#undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
#undef RT_INVALID_SLOT_IDX
#undef RT_SLAB_BLOCK_SIZE
#undef RT_RADIX_TREE_MAGIC
--
2.31.1
v23-0006-Fix-compile-error-when-RT_VALUE_TYPE-is-non-inte.patchapplication/octet-stream; name=v23-0006-Fix-compile-error-when-RT_VALUE_TYPE-is-non-inte.patchDownload
From 00d0b18389d7852b34a3eee16f69038a2f07ebaa Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 10:43:01 +0900
Subject: [PATCH v23 06/18] Fix compile error when RT_VALUE_TYPE is
non-integer.
'value' must be initialized since we assign value
to *value_p to supress compiler warning.
---
src/include/lib/radixtree_search_impl.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index c4352045c8..a319c46c39 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -15,7 +15,8 @@
uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
#ifdef RT_NODE_LEVEL_LEAF
- RT_VALUE_TYPE value = 0;
+ RT_VALUE_TYPE value;
+ MemSet(&value, 0, sizeof(RT_VALUE_TYPE));
Assert(RT_NODE_IS_LEAF(node));
#else
--
2.31.1
v23-0005-Tool-for-measuring-radix-tree-performance.patchapplication/octet-stream; name=v23-0005-Tool-for-measuring-radix-tree-performance.patchDownload
From 5157516f81a3f19de42809fbaec6f3b1e523c68a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v23 05/18] Tool for measuring radix tree performance
Includes Meson support, but commented out to avoid warnings
XXX: Not for commit
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 76 ++
contrib/bench_radix_tree/bench_radix_tree.c | 656 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/meson.build | 33 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
contrib/meson.build | 1 +
8 files changed, 822 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/meson.build
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..2fd689aa91
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..4c785c7336
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,656 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ rt_radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 search_time_ms;
+ Datum values[2] = {0};
+ bool nulls[2] = {0};
+ /* from trial and error */
+ uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (!PG_ARGISNULL(1))
+ {
+ char *filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+ if (sscanf(filter_str, "0x%lX", &filter) == 0)
+ elog(ERROR, "invalid filter string %s", filter_str);
+ }
+ elog(NOTICE, "bench with filter 0x%lX", filter);
+
+ rt = rt_create(CurrentMemoryContext);
+
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+
+ rt_set(rt, key, key);
+ }
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_sparseload_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ uint64 r,
+ h;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ key_id = 0;
+
+ for (r = 1; r <= fanout; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (h);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+
+ rt_stats(rt);
+
+ /* measure sparse deletion and re-loading */
+ start_time = GetCurrentTimestamp();
+
+ for (int t = 0; t<10000; t++)
+ {
+ /* delete one key in each leaf */
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ rt_delete(rt, key);
+ }
+
+ /* add them all back */
+ key_id = 0;
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_sparseload_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+ 'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+ bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'bench_radix_tree',
+ '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+ bench_radix_tree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+ 'bench_radix_tree.control',
+ 'bench_radix_tree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'bench_radix_tree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'bench_radix_tree',
+ ],
+ },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
subdir('auth_delay')
subdir('auto_explain')
subdir('basic_archive')
+#subdir('bench_radix_tree')
subdir('bloom')
subdir('basebackup_to_shell')
subdir('bool_plperl')
--
2.31.1
v23-0004-Free-all-radix-tree-nodes-recursively.patchapplication/octet-stream; name=v23-0004-Free-all-radix-tree-nodes-recursively.patchDownload
From 9df198bc8781a4d619e4d8c4e584305ef560be48 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 20 Jan 2023 12:38:54 +0700
Subject: [PATCH v23 04/18] Free all radix tree nodes recursively
TODO: Consider adding more general functionality to DSA
to free all segments.
---
src/include/lib/radixtree.h | 78 +++++++++++++++++++++++++++++++++++++
1 file changed, 78 insertions(+)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index bc0c0b5853..7fcd212ea4 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -139,6 +139,7 @@
#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
#define RT_INIT_NODE RT_MAKE_NAME(init_node)
#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
#define RT_EXTEND RT_MAKE_NAME(extend)
#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
@@ -1458,6 +1459,78 @@ RT_GET_HANDLE(RT_RADIX_TREE *tree)
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
return tree->ctl->handle;
}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static inline void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();
+
+ /* The leaf node doesn't have child pointers */
+ if (RT_NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->dsa, ptr);
+ return;
+ }
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+ for (int i = 0; i < n3->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n3->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ for (int i = 0; i < n32->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+ }
+
+ break;
+ }
+ }
+
+ /* Free the inner node */
+ dsa_free(tree->dsa, ptr);
+}
#endif
/*
@@ -1469,6 +1542,10 @@ RT_FREE(RT_RADIX_TREE *tree)
#ifdef RT_SHMEM
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ /* Free all memory used for radix tree nodes */
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_FREE_RECURSE(tree, tree->ctl->root);
+
/*
* Vandalize the control block to help catch programming error where
* other backends access the memory formerly occupied by this radix tree.
@@ -2268,6 +2345,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_ALLOC_NODE
#undef RT_INIT_NODE
#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
#undef RT_EXTEND
#undef RT_SET_EXTEND
#undef RT_SWITCH_NODE_KIND
--
2.31.1
v23-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchapplication/octet-stream; name=v23-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload
From beaaee64bc91286d05b9e3c47e9f42eeb2ff5f19 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v23 02/18] Move some bitmap logic out of bitmapset.c
Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.
Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().
Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.
Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
src/backend/nodes/bitmapset.c | 36 ++------------------------------
src/include/nodes/bitmapset.h | 16 ++++++++++++--
src/include/port/pg_bitutils.h | 31 +++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 -
4 files changed, 47 insertions(+), 37 deletions(-)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x))
/*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
{
int result;
- w = RIGHTMOST_ONE(w);
+ w = bmw_rightmost_one(w);
a->words[wordnum] &= ~w;
result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 0dca6bc5fa..80e91fac0f 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
#else
#define BITS_PER_BITMAPWORD 32
typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
#endif
@@ -75,6 +73,20 @@ typedef enum
BMS_MULTIPLE /* >1 member */
} BMS_Membership;
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w) pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
+#define bmw_popcount(w) pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w) pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
+#define bmw_popcount(w) pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
/*
* function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+ int32 result = (int32) word & -((int32) word);
+ return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+ int64 result = (int64) word & -((int64) word);
+ return (uint64) result;
+}
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 51484ca7e2..077f197a64 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3662,7 +3662,6 @@ shmem_request_hook_type
shmem_startup_hook_type
sig_atomic_t
sigjmp_buf
-signedbitmapword
sigset_t
size_t
slist_head
--
2.31.1
v23-0003-Add-radixtree-template.patchapplication/octet-stream; name=v23-0003-Add-radixtree-template.patchDownload
From 0dfc3627858a18821ac12e9a0f84c922194f3ac7 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v23 03/18] Add radixtree template
The only thing configurable in this commit is function scope,
prefix, and local/shared memory.
The key and value type are still hard-coded to uint64.
(A later commit in v21 will make value type configurable)
It might be good at some point to offer a different tree type,
e.g. "single-value leaves" to allow for variable length keys
and values, giving full flexibility to developers.
TODO: Much broader commit message
---
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 2314 +++++++++++++++++
src/include/lib/radixtree_delete_impl.h | 106 +
src/include/lib/radixtree_insert_impl.h | 317 +++
src/include/lib/radixtree_iter_impl.h | 138 +
src/include/lib/radixtree_search_impl.h | 131 +
src/include/utils/dsa.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 36 +
src/test/modules/test_radixtree/meson.build | 35 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 660 +++++
.../test_radixtree/test_radixtree.control | 4 +
src/tools/pginclude/cpluspluscheck | 6 +
src/tools/pginclude/headerscheck | 6 +
20 files changed, 3817 insertions(+)
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/include/lib/radixtree_delete_impl.h
create mode 100644 src/include/lib/radixtree_insert_impl.h
create mode 100644 src/include/lib/radixtree_iter_impl.h
create mode 100644 src/include/lib/radixtree_search_impl.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 604b702a91..50f0aae3ab 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..bc0c0b5853
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2314 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ * tional leaf node type which stores one value.
+ * - Multi-value leaves: The values are stored in one of four
+ * different leaf node types, which mirror the structure of
+ * inner nodes, but contain values instead of pointers.
+ * - Combined pointer/value slots: If values fit into point-
+ * ers, no separate node types are necessary. Instead, each
+ * pointer storage location in an inner node can either
+ * store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included. Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * will result in radix tree type 'foo_radix_tree' and functions like
+ * 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ * generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ * declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
+ *
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ * so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE - Create a new, empty radix tree
+ * RT_FREE - Free the radix tree
+ * RT_SEARCH - Search a key-value pair
+ * RT_SET - Set a key-value pair
+ * RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT - Return next key-value pair, if any
+ * RT_END_ITER - End iteration
+ * RT_MEMORY_USAGE - Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH - Attach to the radix tree
+ * RT_DETACH - Detach from the radix tree
+ * RT_GET_HANDLE - Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE - Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *val_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE val);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif /* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
+#define BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ * statments.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ * in the future to tag the node pointer with the kind, even on
+ * platforms with 32-bit pointers. This might speed up node traversal
+ * in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_3 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Max capacity for the current size class. Storing this in the
+ * node enables multiple size classes per node kind.
+ * Technically, kinds with a single size class don't need this, so we could
+ * keep this in the individual base types, but the code is simpler this way.
+ * Note: node256 is unique in that it cannot possibly have more than a
+ * single size class, so for that kind we store zero, and uint8 is
+ * sufficient for other kinds.
+ */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+#define RT_NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
+
+#define RT_NODE_MUST_GROW(node) \
+ ((node)->base.n.count == (node)->base.n.fanout)
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_3
+{
+ RT_NODE n;
+
+ /* 3 children, for key chunks */
+ uint8 chunks[3];
+} RT_NODE_BASE_3;
+
+typedef struct RT_NODE_BASE_32
+{
+ RT_NODE n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_125
+{
+ RT_NODE n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+ RT_NODE n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_3
+{
+ RT_NODE_BASE_3 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_3;
+
+typedef struct RT_NODE_LEAF_3
+{
+ RT_NODE_BASE_3 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_3;
+
+typedef struct RT_NODE_INNER_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+ RT_NODE_BASE_256 base;
+
+ /* Slots for 256 children */
+ RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+ RT_NODE_BASE_256 base;
+
+ /*
+ * Unlike with inner256, zero is a valid value here, so we use a
+ * bitmap to track which slot is in use.
+ */
+ bitmapword isset[BM_IDX(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ RT_VALUE_TYPE values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+ RT_CLASS_3 = 0,
+ RT_CLASS_32_MIN,
+ RT_CLASS_32_MAX,
+ RT_CLASS_125,
+ RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+} RT_SIZE_CLASS_ELEM;
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+ [RT_CLASS_3] = {
+ .name = "radix tree node 3",
+ .fanout = 3,
+ .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_32_MIN] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_32_MAX] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_125] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(RT_NODE_INNER_256),
+ .leaf_size = sizeof(RT_NODE_LEAF_256),
+ },
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+ RT_HANDLE handle;
+ uint32 magic;
+#endif
+
+ RT_PTR_ALLOC root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+ MemoryContext context;
+
+ /* pointing to either local memory or DSA */
+ RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+ dsa_area *dsa;
+#else
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+ RT_PTR_LOCAL node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+ RT_RADIX_TREE *tree;
+
+ /* Track the iteration on nodes of each level */
+ RT_NODE_ITER stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is constructed during iteration */
+ uint64 key;
+} RT_ITER;
+
+
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_VALUE_TYPE value);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+ return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+ return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+ return DsaPointerIsValid(ptr);
+#else
+ return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ /* replicate the search key */
+ spread_chunk = vector8_broadcast(chunk);
+
+ /* compare to all 32 keys stored in the node */
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+
+ /* convert comparison to a bitfield */
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+ /* mask off invalid entries */
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ /* convert bitfield to index by counting trailing zeros */
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ /*
+ * This is coded with '>=' to match what we can do with SIMD,
+ * with an assert to keep us honest.
+ */
+ if (node->chunks[index] >= chunk)
+ {
+ Assert(node->chunks[index] != chunk);
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ /*
+ * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+ * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+ * we need to play some trickery using vector8_min() to effectively get
+ * <=. There'll never be any equal elements in urrent uses, but that's
+ * what we get here...
+ */
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+ uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+ uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+ Assert(RT_NODE_IS_LEAF(node));
+ Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+ return node->children[chunk];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ Assert(RT_NODE_IS_LEAF(node));
+ Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ node->isset[idx] |= ((bitmapword) 1 << bitnum);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+ if (key == 0)
+ return 0;
+ else
+ return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+ RT_PTR_ALLOC allocnode;
+ size_t allocsize;
+
+ if (is_leaf)
+ allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+ else
+ allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+ allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+ if (is_leaf)
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ allocsize);
+ else
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+ allocsize);
+#endif
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->ctl->cnt[size_class]++;
+#endif
+
+ return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+ if (is_leaf)
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+ else
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+
+ node->kind = kind;
+
+ if (kind == RT_NODE_KIND_256)
+ /* See comment for the RT_NODE type */
+ Assert(node->fanout == 0);
+ else
+ node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+ memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
+ }
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+ int shift = RT_KEY_GET_SHIFT(key);
+ bool is_leaf = shift == 0;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ newnode->shift = shift;
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+ tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->count = oldnode->count;
+}
+
+/*
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
+ */
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+ uint8 new_kind, uint8 new_class, bool is_leaf)
+{
+ RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
+ RT_COPY_NODE(newnode, node);
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->ctl->root == allocnode)
+ {
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+ tree->ctl->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->ctl->cnt[i]--;
+ Assert(tree->ctl->cnt[i] >= 0);
+ }
+#endif
+
+#ifdef RT_SHMEM
+ dsa_free(tree->dsa, allocnode);
+#else
+ pfree(allocnode);
+#endif
+}
+
+/* Update the parent's pointer when growing a node */
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
+ RT_PTR_ALLOC new_child, uint64 key)
+{
+#ifdef USE_ASSERT_CHECKING
+ RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+ Assert(old_child->shift == new->shift);
+ Assert(old_child->count == new->count);
+#endif
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new larger node */
+ tree->ctl->root = new_child;
+ }
+ else
+ RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+ RT_FREE_NODE(tree, stored_old_child);
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+ int target_shift;
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ int shift = root->shift + RT_NODE_SPAN;
+
+ target_shift = RT_KEY_GET_SHIFT(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ RT_NODE_INNER_3 *n3;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+ node->shift = shift;
+ node->count = 1;
+
+ n3 = (RT_NODE_INNER_3 *) node;
+ n3->base.chunks[0] = 0;
+ n3->children[0] = tree->ctl->root;
+
+ /* Update the root */
+ tree->ctl->root = allocnode;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
+{
+ int shift = node->shift;
+
+ Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ RT_PTR_ALLOC allocchild;
+ RT_PTR_LOCAL newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool is_leaf = newshift == 0;
+
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ newchild->shift = newshift;
+ RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
+
+ parent = node;
+ node = newchild;
+ stored_node = allocchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value);
+ tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static bool
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_VALUE_TYPE value)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+ RT_RADIX_TREE *tree;
+ MemoryContext old_ctx;
+#ifdef RT_SHMEM
+ dsa_pointer dp;
+#endif
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+ tree->context = ctx;
+
+#ifdef RT_SHMEM
+ tree->dsa = dsa;
+ dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+ tree->ctl->handle = dp;
+ tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+#else
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+ /* Create a slab context for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+ size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+ size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ size_class.name,
+ inner_blocksize,
+ size_class.inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ size_class.name,
+ leaf_blocksize,
+ size_class.leaf_size);
+ }
+#endif
+
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+ RT_RADIX_TREE *tree;
+ dsa_pointer control;
+
+ /* XXX: memory context support */
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ tree->dsa = dsa;
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ return tree->ctl->handle;
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->dsa, tree->ctl->handle);
+#else
+ pfree(tree->ctl);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+#endif
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
+{
+ int shift;
+ bool updated;
+ RT_PTR_LOCAL parent;
+ RT_PTR_ALLOC stored_child;
+ RT_PTR_LOCAL child;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ /* Empty tree, create the root */
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_NEW_ROOT(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->ctl->max_val)
+ RT_EXTEND(tree, key);
+
+ stored_child = tree->ctl->root;
+ parent = RT_PTR_GET_LOCAL(tree, stored_child);
+ shift = parent->shift;
+
+ /* Descend the tree until we reach a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC new_child;
+
+ child = RT_PTR_GET_LOCAL(tree, stored_child);
+
+ if (RT_NODE_IS_LEAF(child))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
+ {
+ RT_SET_EXTEND(tree, key, value, parent, stored_child, child);
+ return false;
+ }
+
+ parent = child;
+ stored_child = new_child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->ctl->num_keys++;
+
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+ Assert(value_p != NULL);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ return false;
+
+ node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ shift = node->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child;
+
+ if (RT_NODE_IS_LEAF(node))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ return false;
+
+ node = RT_PTR_GET_LOCAL(tree, child);
+ shift -= RT_NODE_SPAN;
+ }
+
+ return RT_NODE_SEARCH_LEAF(node, key, value_p);
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ return false;
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ RT_PTR_ALLOC child;
+
+ /* Push the current node to the stack */
+ stack[++level] = allocnode;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ return false;
+
+ allocnode = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_LEAF(node, key);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->ctl->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (node->count > 0)
+ return true;
+
+ /* Free the empty leaf node */
+ RT_FREE_NODE(tree, allocnode);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ allocnode = stack[level--];
+
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_INNER(node, key);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (node->count > 0)
+ break;
+
+ /* The node became empty */
+ RT_FREE_NODE(tree, allocnode);
+ }
+
+ return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+ RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+ int level = from;
+ RT_PTR_LOCAL node = from_node;
+
+ for (;;)
+ {
+ RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (RT_NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/* Create and return the iterator for the given radix tree */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+ MemoryContext old_ctx;
+ RT_ITER *iter;
+ RT_PTR_LOCAL root;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree->ctl->root)
+ return iter;
+
+ root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+ top_level = root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->ctl->root)
+ return false;
+
+ for (;;)
+ {
+ RT_PTR_LOCAL child = NULL;
+ RT_VALUE_TYPE value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+ pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+ // XXX is this necessary?
+ Size total = sizeof(RT_RADIX_TREE);
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ total = dsa_get_total_size(tree->dsa);
+#else
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+#endif
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
+
+ for (int i = 1; i < n3->n.count; i++)
+ Assert(n3->chunks[i - 1] < n3->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ uint8 slot = n125->slot_idxs[i];
+ int idx = BM_IDX(slot);
+ int bitnum = BM_BIT(slot);
+
+ if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(slot < node->fanout);
+ Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
+ cnt += bmw_popcount(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+ ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+ tree->ctl->num_keys,
+ tree->ctl->root->shift / RT_NODE_SPAN,
+ tree->ctl->cnt[RT_CLASS_3],
+ tree->ctl->cnt[RT_CLASS_32_MIN],
+ tree->ctl->cnt[RT_CLASS_32_MAX],
+ tree->ctl->cnt[RT_CLASS_125],
+ tree->ctl->cnt[RT_CLASS_256])));
+}
+
+/* XXX For display, assumes value type is numeric */
+static void
+RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
+{
+ char space[125] = {0};
+
+ fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
+ RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_3) ? 3 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift);
+
+ if (level > 0)
+ sprintf(space, "%*c", level * 4, ' ');
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
+ space, n3->base.chunks[i], (uint64) n3->values[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n3->base.chunks[i]);
+
+ if (recurse)
+ RT_DUMP_NODE(n3->children[i], level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
+ space, n32->base.chunks[i], (uint64) n32->values[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ RT_DUMP_NODE(n32->children[i], level + 1, recurse);
+ }
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+
+ fprintf(stderr, "slot_idxs ");
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+ }
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
+
+ fprintf(stderr, ", isset-bitmap:");
+ for (int i = 0; i < BM_IDX(RT_SLOT_IDX_LIMIT); i++)
+ {
+ fprintf(stderr, RT_UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
+ }
+ fprintf(stderr, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
+ space, i, (uint64) RT_NODE_LEAF_125_GET_VALUE(n125, i));
+ }
+ else
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ RT_DUMP_NODE(RT_NODE_INNER_125_GET_CHILD(n125, i),
+ level + 1, recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
+ space, i, (uint64) RT_NODE_LEAF_256_GET_VALUE(n256, i));
+ }
+ else
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ fprintf(stderr, "%schunk 0x%X ->",
+ space, i);
+
+ if (recurse)
+ RT_DUMP_NODE(RT_NODE_INNER_256_GET_CHILD(n256, i), level + 1,
+ recurse);
+ else
+ fprintf(stderr, "\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+ int level = 0;
+
+ elog(NOTICE, "-----------------------------------------------------------");
+ elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ")",
+ tree->ctl->max_val, tree->ctl->max_val);
+
+ if (!tree->ctl->root)
+ {
+ elog(NOTICE, "tree is empty");
+ return;
+ }
+
+ if (key > tree->ctl->max_val)
+ {
+ elog(NOTICE, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val",
+ key, key);
+ return;
+ }
+
+ node = tree->ctl->root;
+ shift = tree->ctl->root->shift;
+ while (shift >= 0)
+ {
+ RT_PTR_LOCAL child;
+
+ RT_DUMP_NODE(node, level, false);
+
+ if (RT_NODE_IS_LEAF(node))
+ {
+ uint64 dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+ break;
+ }
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ break;
+
+ node = child;
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ fprintf(stderr, "%s\tinner_size %zu\tleaf_size %zu\t%zu\n",
+ RT_SIZE_CLASS_INFO[i].name,
+ RT_SIZE_CLASS_INFO[i].inner_size,
+ RT_SIZE_CLASS_INFO[i].leaf_size);
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+
+ if (!tree->ctl->root)
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ RT_DUMP_NODE(tree->ctl->root, 0, true);
+}
+#endif
+
+#endif /* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef BM_IDX
+#undef BM_BIT
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_3
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_3
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_3
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
+#undef RT_CLASS_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_SWITCH_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..99c90771b9
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,106 @@
+/* TODO: shrink nodes */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+ n3->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+ n3->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
+ n32->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int idx;
+ int bitnum;
+
+ if (slotpos == RT_INVALID_SLOT_IDX)
+ return false;
+
+ idx = BM_IDX(slotpos);
+ bitnum = BM_BIT(slotpos);
+ n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+ n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+ RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+ break;
+ }
+ }
+
+ /* update statistics */
+ node->count--;
+
+ return true;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..22aca0e6cc
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,317 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ const bool is_leaf = true;
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+ const bool is_leaf = false;
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx;
+
+ idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n3->values[idx] = value;
+#else
+ n3->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n3)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE32_TYPE *new32;
+ const uint8 new_kind = RT_NODE_KIND_32;
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
+
+ /* grow node from 3 to 32 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
+ new32->base.chunks, new32->values);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
+ new32->base.chunks, new32->children);
+#endif
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+ int count = n3->base.n.count;
+
+ /* shift chunks and children */
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
+ count, insertpos);
+#endif
+ }
+
+ n3->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n3->values[insertpos] = value;
+#else
+ n3->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx;
+
+ idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[idx] = value;
+#else
+ n32->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n32)) &&
+ n32->base.n.fanout < class32_max.fanout)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+ Assert(n32->base.n.fanout == class32_min.fanout);
+
+ /* grow to the next size class of this kind */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ n32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ memcpy(newnode, node, class32_min.leaf_size);
+#else
+ memcpy(newnode, node, class32_min.inner_size);
+#endif
+ newnode->fanout = class32_max.fanout;
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n32)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE125_TYPE *new125;
+ const uint8 new_kind = RT_NODE_KIND_125;
+ const RT_SIZE_CLASS new_class = RT_CLASS_125;
+
+ Assert(n32->base.n.fanout == class32_max.fanout);
+
+ /* grow node from 32 to 125 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new125 = (RT_NODE125_TYPE *) newnode;
+
+ for (int i = 0; i < class32_max.fanout; i++)
+ {
+ new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ new125->values[i] = n32->values[i];
+#else
+ new125->children[i] = n32->children[i];
+#endif
+ }
+
+ /*
+ * Since we just copied a dense array, we can set the bits
+ * using a single store, provided the length of that array
+ * is at most the number of bits in a bitmapword.
+ */
+ Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+ new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+ int count = n32->base.n.count;
+
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+ count, insertpos);
+#endif
+ }
+
+ n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = value;
+#else
+ n32->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int cnt = 0;
+
+ if (slotpos != RT_INVALID_SLOT_IDX)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = value;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n125)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE256_TYPE *new256;
+ const uint8 new_kind = RT_NODE_KIND_256;
+ const RT_SIZE_CLASS new_class = RT_CLASS_256;
+
+ /* grow node from 125 to 256 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new256 = (RT_NODE256_TYPE *) newnode;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+ RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ cnt++;
+ }
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int idx;
+ bitmapword inverse;
+
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+ {
+ if (n125->base.isset[idx] < ~((bitmapword) 0))
+ break;
+ }
+
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(n125->base.isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+ Assert(slotpos < node->fanout);
+
+ /* mark the slot used */
+ n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+ n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = value;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+#else
+ chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
+#endif
+ Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(n256, chunk, value);
+#else
+ RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ RT_VERIFY_NODE(node);
+
+ return chunk_exists;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..823d7107c4
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,138 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ bool found = false;
+ uint8 key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_VALUE_TYPE value;
+
+ Assert(RT_NODE_IS_LEAF(node_iter->node));
+#else
+ RT_PTR_LOCAL child = NULL;
+
+ Assert(!RT_NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+ Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n3->base.n.count)
+ break;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n3->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+#endif
+ key_chunk = n3->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+#ifdef RT_NODE_LEVEL_LEAF
+ if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+ if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = value;
+#endif
+ }
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return found;
+#else
+ return child;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..c4352045c8
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,131 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_VALUE_TYPE value = 0;
+
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+#endif
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n3->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n3->values[idx];
+#else
+ child = n3->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n32->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[idx];
+#else
+ child = n32->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+ Assert(slotpos != RT_INVALID_SLOT_IDX);
+ n125->children[slotpos] = new_child;
+#else
+ if (slotpos == RT_INVALID_SLOT_IDX)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+ child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+ RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+ child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ }
+
+#ifdef RT_ACTION_UPDATE
+ return;
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(value_p != NULL);
+ *value_p = value;
+#else
+ Assert(child_p != NULL);
+ *child_p = child;
+#endif
+
+ return true;
+#endif /* RT_ACTION_UPDATE */
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 104386e674..c67f936880 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
test_pg_db_role_setting \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
subdir('test_pg_db_role_setting')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..64d46dfe9a
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,660 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int rt_node_kind_fanouts[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 125, /* RT_NODE_KIND_125 */
+ 256 /* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ TestValueType dummy;
+ uint64 key;
+ TestValueType val;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ rt_radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], (TestValueType) keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* look up keys */
+ for (int i = 0; i < children; i++)
+ {
+ TestValueType value;
+
+ if (!rt_search(radixtree, keys[i], &value))
+ elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (value != (TestValueType) keys[i])
+ elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+ value, (TestValueType) keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_set(radixtree, keys[i], (TestValueType) (keys[i] + 1)))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], (TestValueType) keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ TestValueType val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != (TestValueType) key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, (TestValueType) key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx - 1]
+ : rt_node_kind_fanouts[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx]
+ : rt_node_kind_fanouts[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(radixtree_ctx, dsa);
+#else
+ radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, (TestValueType) x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ TestValueType v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != (TestValueType) x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ TestValueType val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != (TestValueType) expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ TestValueType v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ rt_free(radixtree);
+ MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ test_basic(rt_node_kind_fanouts[i], false);
+ test_basic(rt_node_kind_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
--
2.31.1
v23-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/octet-stream; name=v23-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload
From 990c01fbf68b39b5f2c6109440f63e6c305ba7f0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v23 01/18] introduce vector8_min and vector8_highbit_mask
TODO: commit message
TODO: Remove uint64 case.
separate-commit TODO: move non-SIMD fallbacks to own header
to clean up the #ifdef maze.
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..84d41a340a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
static inline bool vector8_has_zero(const Vector8 v);
static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
#endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
#endif
}
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+ uint32 mask = 0;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+ return mask;
+#endif
+}
+
/*
* Exactly like vector8_is_highbit_set except for the input type, so it
* looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.31.1
On Thu, Jan 26, 2023 at 3:33 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Thu, Jan 26, 2023 at 3:54 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
I think that we need to prevent concurrent updates (RT_SET() and
RT_DELETE()) during the iteration to get the consistent result through
the whole iteration operation. Unlike other operations such as
RT_SET(), we cannot expect that a job doing something for each
key-value pair in the radix tree completes in a short time, so we
cannot keep holding the radix tree lock until the end of the
iteration.
This sounds like a performance concern, rather than a correctness concern,
is that right? If so, I don't think we should worry too much about
optimizing simple locking, because it will *never* be fast enough for
highly-concurrent read-write workloads anyway, and anyone interested in
those workloads will have to completely replace the locking scheme,
possibly using one of the ideas in the last ART paper you mentioned.
The first implementation should be simple, easy to test/verify, easy to
understand, and easy to replace. As much as possible anyway.
So the idea is that we set iter_active to true (with the
lock in exclusive mode), and prevent concurrent updates when the flag
is true.
...by throwing elog(ERROR)? I'm not so sure users of this API would prefer
that to waiting.
Since there were calls to LWLockAcquire/Release in the last version,
I'm a bit confused by this. Perhaps for the next patch, the email should
contain a few sentences describing how locking is intended to work,
including for iteration.
The lock I'm thinking of adding is a simple readers-writer lock. This
lock is used for concurrent radix tree operations except for the
iteration. For operations concurrent to the iteration, I used a flag
for the reason I mentioned above.
This doesn't tell me anything -- we already agreed on "simple reader-writer
lock", months ago I believe. And I only have a vague idea about the
tradeoffs made regarding iteration.
+ * WIP: describe about how locking works.
A first draft of what is intended for this WIP would be a good start. This
WIP is from v23-0016, which contains no comments and a one-line commit
message. I'd rather not try closely studying that patch (or how it works
with 0011) until I have a clearer understanding of what requirements are
assumed, what trade-offs are considered, and how it should be tested.
[thinks some more...] Is there an API-level assumption that hasn't been
spelled out? Would it help to have a parameter for whether the iteration
function wants to reserve the privilege to perform writes? It could take
the appropriate lock at the start, and there could then be multiple
read-only iterators, but only one read/write iterator. Note, I'm just
guessing here, and I don't want to make things more difficult for future
improvements.
Hmm, I wonder if we need to use the isolation tester. It's both a
blessing and a curse that the first client of this data structure is tid
lookup. It's a blessing because it doesn't present a highly-concurrent
workload mixing reads and writes and so simple locking is adequate. It's a
curse because to test locking and have any chance of finding bugs, we can't
rely on vacuum to tell us that because (as you've said) it might very well
work fine with no locking at all. So we must come up with test cases
ourselves.
Using the isolation tester to test locking seems like a good idea. We
can include it in test_radixtree. But given that the locking in the
radix tree is very simple, the test case would be very simple. It may
be controversial whether it's worth adding such testing by adding both
the new test module and test cases.
I mean that the isolation tester (or something else) would contain test
cases. I didn't mean to imply redundant testing.
I think the user (e.g, vacuumlazy.c) can pass the maximum offset
number to the parallel vacuum.
Okay, sounds good.
Most of v23's cleanups/fixes in the radix template look good to me,
although I didn't read the debugging code very closely. There is one
exception:
0006 - I've never heard of memset'ing a variable to avoid "variable unused"
compiler warnings, and it seems strange. It turns out we don't actually
need this variable in the first place. The attached .txt patch removes the
local variable and just writes to the passed pointer. This required callers
to initialize a couple of their own variables, but only child pointers, at
least on gcc 12. And I will work later on making "value" in the public API
a pointer.
0017 - I haven't taken a close look at the new changes, but I did notice
this some time ago:
+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+ else
+ return sizeof(TidStore) + sizeof(TidStore) +
+ local_rt_memory_usage(ts->tree.local);
There is repetition in the else branch.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
remove-intermediate-variables.txttext/plain; charset=US-ASCII; name=remove-intermediate-variables.txtDownload
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 542daae6d0..c2ee7f4fa1 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1618,7 +1618,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
/* Descend the tree until we reach a leaf node */
while (shift >= 0)
{
- RT_PTR_ALLOC new_child;
+ RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;
child = RT_PTR_GET_LOCAL(tree, stored_child);
@@ -1678,7 +1678,7 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
/* Descend the tree until a leaf node */
while (shift >= 0)
{
- RT_PTR_ALLOC child;
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
if (RT_NODE_IS_LEAF(node))
break;
@@ -1742,7 +1742,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
level = -1;
while (shift > 0)
{
- RT_PTR_ALLOC child;
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
/* Push the current node to the stack */
stack[++level] = allocnode;
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index a319c46c39..c8410e9a5c 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -15,13 +15,11 @@
uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
#ifdef RT_NODE_LEVEL_LEAF
- RT_VALUE_TYPE value;
- MemSet(&value, 0, sizeof(RT_VALUE_TYPE));
-
+ Assert(value_p != NULL);
Assert(RT_NODE_IS_LEAF(node));
#else
#ifndef RT_ACTION_UPDATE
- RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+ Assert(child_p != NULL);
#endif
Assert(!RT_NODE_IS_LEAF(node));
#endif
@@ -41,9 +39,9 @@
return false;
#ifdef RT_NODE_LEVEL_LEAF
- value = n3->values[idx];
+ *value_p = n3->values[idx];
#else
- child = n3->children[idx];
+ *child_p = n3->children[idx];
#endif
#endif /* RT_ACTION_UPDATE */
break;
@@ -61,9 +59,9 @@
return false;
#ifdef RT_NODE_LEVEL_LEAF
- value = n32->values[idx];
+ *value_p = n32->values[idx];
#else
- child = n32->children[idx];
+ *child_p = n32->children[idx];
#endif
#endif /* RT_ACTION_UPDATE */
break;
@@ -81,9 +79,9 @@
return false;
#ifdef RT_NODE_LEVEL_LEAF
- value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+ *value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
#else
- child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+ *child_p = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
#endif
#endif /* RT_ACTION_UPDATE */
break;
@@ -103,9 +101,9 @@
return false;
#ifdef RT_NODE_LEVEL_LEAF
- value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+ *value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
#else
- child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+ *child_p = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
#endif
#endif /* RT_ACTION_UPDATE */
break;
@@ -115,14 +113,6 @@
#ifdef RT_ACTION_UPDATE
return;
#else
-#ifdef RT_NODE_LEVEL_LEAF
- Assert(value_p != NULL);
- *value_p = value;
-#else
- Assert(child_p != NULL);
- *child_p = child;
-#endif
-
return true;
#endif /* RT_ACTION_UPDATE */
On Sat, Jan 28, 2023 at 8:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Thu, Jan 26, 2023 at 3:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Jan 26, 2023 at 3:54 PM John Naylor
<john.naylor@enterprisedb.com> wrote:I think that we need to prevent concurrent updates (RT_SET() and
RT_DELETE()) during the iteration to get the consistent result through
the whole iteration operation. Unlike other operations such as
RT_SET(), we cannot expect that a job doing something for each
key-value pair in the radix tree completes in a short time, so we
cannot keep holding the radix tree lock until the end of the
iteration.This sounds like a performance concern, rather than a correctness concern, is that right? If so, I don't think we should worry too much about optimizing simple locking, because it will *never* be fast enough for highly-concurrent read-write workloads anyway, and anyone interested in those workloads will have to completely replace the locking scheme, possibly using one of the ideas in the last ART paper you mentioned.
The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As much as possible anyway.
Yes, but if a concurrent writer waits for another process to finish
the iteration, it ends up waiting on a lwlock, which is not
interruptible.
So the idea is that we set iter_active to true (with the
lock in exclusive mode), and prevent concurrent updates when the flag
is true....by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting.
Right. I think if we want to wait rather than an ERROR, the waiter
should wait in an interruptible way, for example, a condition
variable. I did a simpler way in the v22 patch.
...but looking at dshash.c, dshash_seq_next() seems to return an entry
while holding a lwlock on the partition. My assumption might be wrong.
Since there were calls to LWLockAcquire/Release in the last version, I'm a bit confused by this. Perhaps for the next patch, the email should contain a few sentences describing how locking is intended to work, including for iteration.
The lock I'm thinking of adding is a simple readers-writer lock. This
lock is used for concurrent radix tree operations except for the
iteration. For operations concurrent to the iteration, I used a flag
for the reason I mentioned above.This doesn't tell me anything -- we already agreed on "simple reader-writer lock", months ago I believe. And I only have a vague idea about the tradeoffs made regarding iteration.
+ * WIP: describe about how locking works.
A first draft of what is intended for this WIP would be a good start. This WIP is from v23-0016, which contains no comments and a one-line commit message. I'd rather not try closely studying that patch (or how it works with 0011) until I have a clearer understanding of what requirements are assumed, what trade-offs are considered, and how it should be tested.
[thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a parameter for whether the iteration function wants to reserve the privilege to perform writes? It could take the appropriate lock at the start, and there could then be multiple read-only iterators, but only one read/write iterator. Note, I'm just guessing here, and I don't want to make things more difficult for future improvements.
Seems a good idea. Given the use case for parallel heap vacuum, it
would be a good idea to support having multiple read-only writers. The
iteration of the v22 is read-only, so if we want to support read-write
iterator, we would need to support a function that modifies the
current key-value returned by the iteration.
Hmm, I wonder if we need to use the isolation tester. It's both a blessing and a curse that the first client of this data structure is tid lookup. It's a blessing because it doesn't present a highly-concurrent workload mixing reads and writes and so simple locking is adequate. It's a curse because to test locking and have any chance of finding bugs, we can't rely on vacuum to tell us that because (as you've said) it might very well work fine with no locking at all. So we must come up with test cases ourselves.
Using the isolation tester to test locking seems like a good idea. We
can include it in test_radixtree. But given that the locking in the
radix tree is very simple, the test case would be very simple. It may
be controversial whether it's worth adding such testing by adding both
the new test module and test cases.I mean that the isolation tester (or something else) would contain test cases. I didn't mean to imply redundant testing.
Okay, understood.
I think the user (e.g, vacuumlazy.c) can pass the maximum offset
number to the parallel vacuum.Okay, sounds good.
Most of v23's cleanups/fixes in the radix template look good to me, although I didn't read the debugging code very closely. There is one exception:
0006 - I've never heard of memset'ing a variable to avoid "variable unused" compiler warnings, and it seems strange. It turns out we don't actually need this variable in the first place. The attached .txt patch removes the local variable and just writes to the passed pointer. This required callers to initialize a couple of their own variables, but only child pointers, at least on gcc 12.
Agreed with the attached patch.
And I will work later on making "value" in the public API a pointer.
Thanks!
0017 - I haven't taken a close look at the new changes, but I did notice this some time ago:
+ if (TidStoreIsShared(ts)) + return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared); + else + return sizeof(TidStore) + sizeof(TidStore) + + local_rt_memory_usage(ts->tree.local);There is repetition in the else branch.
Agreed, will remove.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Thu, Jan 26, 2023 at 12:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Tue, Jan 24, 2023 at 1:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Mon, Jan 23, 2023 at 6:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:Attached is a rebase to fix conflicts from recent commits.
I have reviewed v22-0022* patch and I have some comments.
1.
It also changes to the column names max_dead_tuples and num_dead_tuples and to
show the progress information in bytes.I think this statement needs to be rephrased.
Could you be more specific?
I mean the below statement in the commit message doesn't look
grammatically correct to me.
"It also changes to the column names max_dead_tuples and
num_dead_tuples and to show the progress information in bytes."
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Sun, Jan 29, 2023 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Sat, Jan 28, 2023 at 8:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
The first implementation should be simple, easy to test/verify, easy to
understand, and easy to replace. As much as possible anyway.
Yes, but if a concurrent writer waits for another process to finish
the iteration, it ends up waiting on a lwlock, which is not
interruptible.So the idea is that we set iter_active to true (with the
lock in exclusive mode), and prevent concurrent updates when the flag
is true....by throwing elog(ERROR)? I'm not so sure users of this API would
prefer that to waiting.
Right. I think if we want to wait rather than an ERROR, the waiter
should wait in an interruptible way, for example, a condition
variable. I did a simpler way in the v22 patch....but looking at dshash.c, dshash_seq_next() seems to return an entry
while holding a lwlock on the partition. My assumption might be wrong.
Using partitions there makes holding a lock less painful on average, I
imagine, but I don't know the details there.
If we make it clear that the first committed version is not (yet) designed
for high concurrency with mixed read-write workloads, I think waiting (as a
protocol) is fine. If waiting is a problem for some use case, at that point
we should just go all the way and replace the locking entirely. In fact, it
might be good to spell this out in the top-level comment and include a link
to the second ART paper.
[thinks some more...] Is there an API-level assumption that hasn't been
spelled out? Would it help to have a parameter for whether the iteration
function wants to reserve the privilege to perform writes? It could take
the appropriate lock at the start, and there could then be multiple
read-only iterators, but only one read/write iterator. Note, I'm just
guessing here, and I don't want to make things more difficult for future
improvements.
Seems a good idea. Given the use case for parallel heap vacuum, it
would be a good idea to support having multiple read-only writers. The
iteration of the v22 is read-only, so if we want to support read-write
iterator, we would need to support a function that modifies the
current key-value returned by the iteration.
Okay, so updating during iteration is not currently supported. It could in
the future, but I'd say that can also wait for fine-grained concurrency
support. Intermediate-term, we should at least make it straightforward to
support:
1) parallel heap vacuum -> multiple read-only iterators
2) parallel heap pruning -> multiple writers
It may or may not be worth it for someone to actually start either of those
projects, and there are other ways to improve vacuum that may be more
pressing. That said, it seems the tid store with global locking would
certainly work fine for #1 and maybe "not too bad" for #2. #2 can also
mitigate waiting by using larger batching, or the leader process could
"pre-warm" the tid store with zero-values using block numbers from the
visibility map.
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Jan 30, 2023 at 1:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Jan 26, 2023 at 12:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Tue, Jan 24, 2023 at 1:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Mon, Jan 23, 2023 at 6:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:Attached is a rebase to fix conflicts from recent commits.
I have reviewed v22-0022* patch and I have some comments.
1.
It also changes to the column names max_dead_tuples and num_dead_tuples and to
show the progress information in bytes.I think this statement needs to be rephrased.
Could you be more specific?
I mean the below statement in the commit message doesn't look
grammatically correct to me."It also changes to the column names max_dead_tuples and
num_dead_tuples and to show the progress information in bytes."
I've changed the commit message in the v23 patch. Please check it.
Other comments are also incorporated in the v23 patch. Thank you for
the comments!
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Mon, Jan 30, 2023 at 1:31 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Sun, Jan 29, 2023 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sat, Jan 28, 2023 at 8:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As much as possible anyway.
Yes, but if a concurrent writer waits for another process to finish
the iteration, it ends up waiting on a lwlock, which is not
interruptible.So the idea is that we set iter_active to true (with the
lock in exclusive mode), and prevent concurrent updates when the flag
is true....by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting.
Right. I think if we want to wait rather than an ERROR, the waiter
should wait in an interruptible way, for example, a condition
variable. I did a simpler way in the v22 patch....but looking at dshash.c, dshash_seq_next() seems to return an entry
while holding a lwlock on the partition. My assumption might be wrong.Using partitions there makes holding a lock less painful on average, I imagine, but I don't know the details there.
If we make it clear that the first committed version is not (yet) designed for high concurrency with mixed read-write workloads, I think waiting (as a protocol) is fine. If waiting is a problem for some use case, at that point we should just go all the way and replace the locking entirely. In fact, it might be good to spell this out in the top-level comment and include a link to the second ART paper.
Agreed. Will update the comments.
[thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a parameter for whether the iteration function wants to reserve the privilege to perform writes? It could take the appropriate lock at the start, and there could then be multiple read-only iterators, but only one read/write iterator. Note, I'm just guessing here, and I don't want to make things more difficult for future improvements.
Seems a good idea. Given the use case for parallel heap vacuum, it
would be a good idea to support having multiple read-only writers. The
iteration of the v22 is read-only, so if we want to support read-write
iterator, we would need to support a function that modifies the
current key-value returned by the iteration.Okay, so updating during iteration is not currently supported. It could in the future, but I'd say that can also wait for fine-grained concurrency support. Intermediate-term, we should at least make it straightforward to support:
1) parallel heap vacuum -> multiple read-only iterators
2) parallel heap pruning -> multiple writersIt may or may not be worth it for someone to actually start either of those projects, and there are other ways to improve vacuum that may be more pressing. That said, it seems the tid store with global locking would certainly work fine for #1 and maybe "not too bad" for #2. #2 can also mitigate waiting by using larger batching, or the leader process could "pre-warm" the tid store with zero-values using block numbers from the visibility map.
True. Using a larger batching method seems to be worth testing when we
implement the parallel heap pruning.
In the next version patch, I'm going to update the locking support
part and incorporate other comments I got.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Mon, Jan 30, 2023 at 11:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Jan 30, 2023 at 1:31 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Sun, Jan 29, 2023 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sat, Jan 28, 2023 at 8:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As much as possible anyway.
Yes, but if a concurrent writer waits for another process to finish
the iteration, it ends up waiting on a lwlock, which is not
interruptible.So the idea is that we set iter_active to true (with the
lock in exclusive mode), and prevent concurrent updates when the flag
is true....by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting.
Right. I think if we want to wait rather than an ERROR, the waiter
should wait in an interruptible way, for example, a condition
variable. I did a simpler way in the v22 patch....but looking at dshash.c, dshash_seq_next() seems to return an entry
while holding a lwlock on the partition. My assumption might be wrong.Using partitions there makes holding a lock less painful on average, I imagine, but I don't know the details there.
If we make it clear that the first committed version is not (yet) designed for high concurrency with mixed read-write workloads, I think waiting (as a protocol) is fine. If waiting is a problem for some use case, at that point we should just go all the way and replace the locking entirely. In fact, it might be good to spell this out in the top-level comment and include a link to the second ART paper.
Agreed. Will update the comments.
[thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a parameter for whether the iteration function wants to reserve the privilege to perform writes? It could take the appropriate lock at the start, and there could then be multiple read-only iterators, but only one read/write iterator. Note, I'm just guessing here, and I don't want to make things more difficult for future improvements.
Seems a good idea. Given the use case for parallel heap vacuum, it
would be a good idea to support having multiple read-only writers. The
iteration of the v22 is read-only, so if we want to support read-write
iterator, we would need to support a function that modifies the
current key-value returned by the iteration.Okay, so updating during iteration is not currently supported. It could in the future, but I'd say that can also wait for fine-grained concurrency support. Intermediate-term, we should at least make it straightforward to support:
1) parallel heap vacuum -> multiple read-only iterators
2) parallel heap pruning -> multiple writersIt may or may not be worth it for someone to actually start either of those projects, and there are other ways to improve vacuum that may be more pressing. That said, it seems the tid store with global locking would certainly work fine for #1 and maybe "not too bad" for #2. #2 can also mitigate waiting by using larger batching, or the leader process could "pre-warm" the tid store with zero-values using block numbers from the visibility map.
True. Using a larger batching method seems to be worth testing when we
implement the parallel heap pruning.In the next version patch, I'm going to update the locking support
part and incorporate other comments I got.
I've attached v24 patches. The locking support patch is separated
(0005 patch). Also I kept the updates for TidStore and the vacuum
integration from v23 separate.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v24-0005-Add-read-write-lock-to-radix-tree-in-RT_SHMEM-ca.patchapplication/octet-stream; name=v24-0005-Add-read-write-lock-to-radix-tree-in-RT_SHMEM-ca.patchDownload
From 1085ef0b9b8b31795616abc43063a91b27e7d5a4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 17:43:29 +0900
Subject: [PATCH v24 5/9] Add read-write lock to radix tree in RT_SHMEM case.
---
src/include/lib/radixtree.h | 102 ++++++++++++++++--
.../modules/test_radixtree/test_radixtree.c | 8 +-
2 files changed, 100 insertions(+), 10 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index f591d903fc..48134b10e4 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -40,6 +40,18 @@
* There are some optimizations not yet implemented, particularly path
* compression and lazy path expansion.
*
+ * To handle concurrency, we use a single reader-writer lock for the radix
+ * tree. The radix tree is exclusively locked during write operations such
+ * as RT_SET() and RT_DELETE(), and shared locked during read operations
+ * such as RT_SEARCH(). An iteration also holds the shared lock on the radix
+ * tree until it is completed.
+ *
+ * TODO: The current locking mechanism is not optimized for high concurrency
+ * with mixed read-write workloads. In the future it might be worthwhile
+ * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
+ * the paper "The ART of Practical Synchronization" by the same authors as
+ * the ART paper, 2016.
+ *
* WIP: the radix tree nodes don't shrink.
*
* To generate a radix tree and associated functions for a use case several
@@ -224,7 +236,7 @@ typedef dsa_pointer RT_HANDLE;
#endif
#ifdef RT_SHMEM
-RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
@@ -371,6 +383,16 @@ typedef struct RT_NODE
#define RT_INVALID_PTR_ALLOC NULL
#endif
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree) LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree) LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree) LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree) ((void) 0)
+#define RT_LOCK_SHARED(tree) ((void) 0)
+#define RT_UNLOCK(tree) ((void) 0)
+#endif
+
/*
* Inner nodes and leaf nodes have analogous structure. To distinguish
* them at runtime, we take advantage of the fact that the key chunk
@@ -596,6 +618,7 @@ typedef struct RT_RADIX_TREE_CONTROL
#ifdef RT_SHMEM
RT_HANDLE handle;
uint32 magic;
+ LWLock lock;
#endif
RT_PTR_ALLOC root;
@@ -1376,7 +1399,7 @@ RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC store
*/
RT_SCOPE RT_RADIX_TREE *
#ifdef RT_SHMEM
-RT_CREATE(MemoryContext ctx, dsa_area *dsa)
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
#else
RT_CREATE(MemoryContext ctx)
#endif
@@ -1398,6 +1421,7 @@ RT_CREATE(MemoryContext ctx)
tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
tree->ctl->handle = dp;
tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+ LWLockInitialize(&tree->ctl->lock, tranche_id);
#else
tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
@@ -1581,6 +1605,8 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
#endif
+ RT_LOCK_EXCLUSIVE(tree);
+
/* Empty tree, create the root */
if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
RT_NEW_ROOT(tree, key);
@@ -1606,6 +1632,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
{
RT_SET_EXTEND(tree, key, value, parent, stored_child, child);
+ RT_UNLOCK(tree);
return false;
}
@@ -1620,12 +1647,13 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
if (!updated)
tree->ctl->num_keys++;
+ RT_UNLOCK(tree);
return updated;
}
/*
* Search the given key in the radix tree. Return true if there is the key,
- * otherwise return false. On success, we set the value to *val_p so it must
+ * otherwise return false. On success, we set the value to *val_p so it must
* not be NULL.
*/
RT_SCOPE bool
@@ -1633,14 +1661,20 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
{
RT_PTR_LOCAL node;
int shift;
+ bool found;
#ifdef RT_SHMEM
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
#endif
Assert(value_p != NULL);
+ RT_LOCK_SHARED(tree);
+
if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
return false;
+ }
node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
shift = node->shift;
@@ -1654,13 +1688,19 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
break;
if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_UNLOCK(tree);
return false;
+ }
node = RT_PTR_GET_LOCAL(tree, child);
shift -= RT_NODE_SPAN;
}
- return RT_NODE_SEARCH_LEAF(node, key, value_p);
+ found = RT_NODE_SEARCH_LEAF(node, key, value_p);
+
+ RT_UNLOCK(tree);
+ return found;
}
#ifdef RT_USE_DELETE
@@ -1682,8 +1722,13 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
#endif
+ RT_LOCK_EXCLUSIVE(tree);
+
if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
return false;
+ }
/*
* Descend the tree to search the key while building a stack of nodes we
@@ -1702,7 +1747,10 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
node = RT_PTR_GET_LOCAL(tree, allocnode);
if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_UNLOCK(tree);
return false;
+ }
allocnode = child;
shift -= RT_NODE_SPAN;
@@ -1715,6 +1763,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
if (!deleted)
{
/* no key is found in the leaf node */
+ RT_UNLOCK(tree);
return false;
}
@@ -1726,7 +1775,10 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
* node.
*/
if (node->count > 0)
+ {
+ RT_UNLOCK(tree);
return true;
+ }
/* Free the empty leaf node */
RT_FREE_NODE(tree, allocnode);
@@ -1748,6 +1800,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
RT_FREE_NODE(tree, allocnode);
}
+ RT_UNLOCK(tree);
return true;
}
#endif
@@ -1812,7 +1865,12 @@ RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
}
}
-/* Create and return the iterator for the given radix tree */
+/*
+ * Create and return the iterator for the given radix tree.
+ *
+ * The radix tree is locked in shared mode during the iteration, so
+ * RT_END_ITERATE needs to be called when finished to release the lock.
+ */
RT_SCOPE RT_ITER *
RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
{
@@ -1826,6 +1884,8 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
iter->tree = tree;
+ RT_LOCK_SHARED(tree);
+
/* empty tree */
if (!iter->tree->ctl->root)
return iter;
@@ -1846,7 +1906,7 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
}
/*
- * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * Return true with setting key_p and value_p if there is next key. Otherwise
* return false.
*/
RT_SCOPE bool
@@ -1901,9 +1961,20 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
return false;
}
+/*
+ * Terminate the iteration and release the lock.
+ *
+ * This function needs to be called after finishing or when exiting an
+ * iteration.
+ */
RT_SCOPE void
RT_END_ITERATE(RT_ITER *iter)
{
+#ifdef RT_SHMEM
+ Assert(LWLockHeldByMe(&iter->tree->ctl->lock));
+#endif
+
+ RT_UNLOCK(iter->tree);
pfree(iter);
}
@@ -1915,6 +1986,8 @@ RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
{
Size total = 0;
+ RT_LOCK_SHARED(tree);
+
#ifdef RT_SHMEM
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
total = dsa_get_total_size(tree->dsa);
@@ -1926,6 +1999,7 @@ RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
}
#endif
+ RT_UNLOCK(tree);
return total;
}
@@ -2010,6 +2084,8 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
RT_SCOPE void
RT_STATS(RT_RADIX_TREE *tree)
{
+ RT_LOCK_SHARED(tree);
+
fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
@@ -2029,6 +2105,8 @@ RT_STATS(RT_RADIX_TREE *tree)
tree->ctl->cnt[RT_CLASS_125],
tree->ctl->cnt[RT_CLASS_256]);
}
+
+ RT_UNLOCK(tree);
}
static void
@@ -2222,14 +2300,18 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
RT_STATS(tree);
+ RT_LOCK_SHARED(tree);
+
if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
{
+ RT_UNLOCK(tree);
fprintf(stderr, "empty tree\n");
return;
}
if (key > tree->ctl->max_val)
{
+ RT_UNLOCK(tree);
fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
key, key);
return;
@@ -2263,6 +2345,7 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
shift -= RT_NODE_SPAN;
level++;
}
+ RT_UNLOCK(tree);
fprintf(stderr, "%s", buf.data);
}
@@ -2274,8 +2357,11 @@ RT_DUMP(RT_RADIX_TREE *tree)
RT_STATS(tree);
+ RT_LOCK_SHARED(tree);
+
if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
{
+ RT_UNLOCK(tree);
fprintf(stderr, "empty tree\n");
return;
}
@@ -2283,6 +2369,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
initStringInfo(&buf);
RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+ RT_UNLOCK(tree);
fprintf(stderr, "%s",buf.data);
}
@@ -2310,6 +2397,9 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_GET_KEY_CHUNK
#undef BM_IDX
#undef BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
#undef RT_NODE_IS_LEAF
#undef RT_NODE_MUST_GROW
#undef RT_NODE_KIND_COUNT
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 2a93e731ae..bbe1a619b6 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -144,7 +144,7 @@ test_empty(void)
dsa_area *dsa;
dsa = dsa_create(tranche_id);
- radixtree = rt_create(CurrentMemoryContext, dsa);
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
#else
radixtree = rt_create(CurrentMemoryContext);
#endif
@@ -195,7 +195,7 @@ test_basic(int children, bool test_inner)
test_inner ? "inner" : "leaf", children);
#ifdef RT_SHMEM
- radixtree = rt_create(CurrentMemoryContext, dsa);
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
#else
radixtree = rt_create(CurrentMemoryContext);
#endif
@@ -363,7 +363,7 @@ test_node_types(uint8 shift)
elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
#ifdef RT_SHMEM
- radixtree = rt_create(CurrentMemoryContext, dsa);
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
#else
radixtree = rt_create(CurrentMemoryContext);
#endif
@@ -434,7 +434,7 @@ test_pattern(const test_spec * spec)
MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
#ifdef RT_SHMEM
- radixtree = rt_create(radixtree_ctx, dsa);
+ radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
#else
radixtree = rt_create(radixtree_ctx);
#endif
--
2.31.1
v24-0008-Update-TidStore-patch-from-v23.patchapplication/octet-stream; name=v24-0008-Update-TidStore-patch-from-v23.patchDownload
From c76104ba85a5668cfbcb236610bc494127642102 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 31 Jan 2023 17:41:31 +0900
Subject: [PATCH v24 8/9] Update TidStore patch from v23.
Incorporate the comments, update comments, and add the description of
concurrency support.
---
src/backend/access/common/tidstore.c | 110 +++++++++++++++------------
1 file changed, 62 insertions(+), 48 deletions(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 89aea71945..f656de2189 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -11,7 +11,10 @@
* to tidstore_create(). Other backends can attach to the shared TidStore by
* tidstore_attach().
*
- * XXX: Only one process is allowed to iterate over the TidStore at a time.
+ * Regarding the concurrency, it basically relies on the concurrency support in
+ * the radix tree, but we acquires the lock on a TidStore in some cases, for
+ * example, when to reset the store and when to access the number tids in the
+ * store (num_tids).
*
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
@@ -23,7 +26,6 @@
*/
#include "postgres.h"
-#include "access/htup_details.h"
#include "access/tidstore.h"
#include "miscadmin.h"
#include "port/pg_bitutils.h"
@@ -87,14 +89,17 @@
#define RT_VALUE_TYPE uint64
#include "lib/radixtree.h"
-/* The header object for a TidStore */
+/* The control object for a TidStore */
typedef struct TidStoreControl
{
- int64 num_tids; /* the number of Tids stored so far */
+ /* the number of tids in the store */
+ int64 num_tids;
+
+ /* These values are never changed after creation */
size_t max_bytes; /* the maximum bytes a TidStore can use */
int max_offset; /* the maximum offset number */
+ int offset_nbits; /* the number of bits required for max_offset */
bool encode_tids; /* do we use tid encoding? */
- int offset_nbits; /* the number of bits used for offset number */
int offset_key_nbits; /* the number of bits of a offset number
* used for the key */
@@ -117,7 +122,7 @@ struct TidStore
*/
TidStoreControl *control;
- /* Storage for Tids. Use either one depending on TidStoreIsShared() */
+ /* Storage for Tids. Use either one depending on TidStoreIsShared() */
union
{
local_rt_radix_tree *local;
@@ -170,24 +175,24 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
/*
* Create the radix tree for the main storage.
*
- * Memory consumption depends on the number of Tids stored, but also on the
+ * Memory consumption depends on the number of stored tids, but also on the
* distribution of them, how the radix tree stores, and the memory management
* that backed the radix tree. The maximum bytes that a TidStore can
* use is specified by the max_bytes in tidstore_create(). We want the total
- * amount of memory consumption not to exceed the max_bytes.
+ * amount of memory consumption by a TidStore not to exceed the max_bytes.
*
- * In non-shared cases, the radix tree uses slab allocators for each kind of
- * node class. The most memory consuming case while adding Tids associated
- * with one page (i.e. during tidstore_add_tids()) is that we allocate the
- * largest radix tree node in a new slab block, which is approximately 70kB.
- * Therefore, we deduct 70kB from the maximum bytes.
+ * In local TidStore cases, the radix tree uses slab allocators for each kind
+ * of node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+ * slab block for a new radix tree node, which is approximately 70kB. Therefore,
+ * we deduct 70kB from the max_bytes.
*
* In shared cases, DSA allocates the memory segments big enough to follow
* a geometric series that approximately doubles the total DSA size (see
* make_new_segment() in dsa.c). We simulated the how DSA increases segment
* size and the simulation revealed, the 75% threshold for the maximum bytes
- * perfectly works in case where it is a power-of-2, and the 60% threshold
- * works for other cases.
+ * perfectly works in case where the max_bytes is a power-of-2, and the 60%
+ * threshold works for other cases.
*/
if (area != NULL)
{
@@ -199,7 +204,7 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
dp = dsa_allocate0(area, sizeof(TidStoreControl));
ts->control = (TidStoreControl *) dsa_get_address(area, dp);
- ts->control->max_bytes =(uint64) (max_bytes * ratio);
+ ts->control->max_bytes = (uint64) (max_bytes * ratio);
ts->area = area;
ts->control->magic = TIDSTORE_MAGIC;
@@ -212,12 +217,16 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
ts->tree.local = local_rt_create(CurrentMemoryContext);
ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
- ts->control->max_bytes = max_bytes - (1024 * 70);
+ ts->control->max_bytes = max_bytes - (70 * 1024);
}
ts->control->max_offset = max_offset;
ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+ /*
+ * We use tid encoding if the number of bits for the offset number doesn't
+ * fix in a value, uint64.
+ */
if (ts->control->offset_nbits > TIDSTORE_VALUE_NBITS)
{
ts->control->encode_tids = true;
@@ -311,7 +320,10 @@ tidstore_destroy(TidStore *ts)
pfree(ts);
}
-/* Forget all collected Tids */
+/*
+ * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * entire TidStore but recreate only the radix tree storage.
+ */
void
tidstore_reset(TidStore *ts)
{
@@ -350,15 +362,6 @@ tidstore_reset(TidStore *ts)
}
}
-static inline void
-tidstore_insert_kv(TidStore *ts, uint64 key, uint64 val)
-{
- if (TidStoreIsShared(ts))
- shared_rt_set(ts->tree.shared, key, val);
- else
- local_rt_set(ts->tree.local, key, val);
-}
-
/* Add Tids on a block to TidStore */
void
tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
@@ -371,8 +374,6 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
- ItemPointerSetBlockNumber(&tid, blkno);
-
if (ts->control->encode_tids)
{
key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
@@ -383,9 +384,9 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
key_base = (uint64) blkno;
nkeys = 1;
}
-
values = palloc0(sizeof(uint64) * nkeys);
+ ItemPointerSetBlockNumber(&tid, blkno);
for (int i = 0; i < num_offsets; i++)
{
uint64 key;
@@ -413,7 +414,10 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
{
uint64 key = key_base + i;
- tidstore_insert_kv(ts, key, values[i]);
+ if (TidStoreIsShared(ts))
+ shared_rt_set(ts->tree.shared, key, values[i]);
+ else
+ local_rt_set(ts->tree.local, key, values[i]);
}
}
@@ -449,8 +453,11 @@ tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
}
/*
- * Prepare to iterate through a TidStore. The caller must be certain that
- * no other backend will attempt to update the TidStore during the iteration.
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during the
+ * iteration, so tidstore_end_iterate() needs to called when finished.
+ *
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
*/
TidStoreIter *
tidstore_begin_iterate(TidStore *ts)
@@ -482,13 +489,14 @@ tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
{
if (TidStoreIsShared(iter->ts))
return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
- else
- return local_rt_iterate_next(iter->tree_iter.local, key, val);
+
+ return local_rt_iterate_next(iter->tree_iter.local, key, val);
}
/*
- * Scan the TidStore and return a TidStoreIterResult representing Tids
- * in one page. Offset numbers in the result is sorted.
+ * Scan the TidStore and return a pointer to TidStoreIterResult that has tids
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
*/
TidStoreIterResult *
tidstore_iterate_next(TidStoreIter *iter)
@@ -502,6 +510,7 @@ tidstore_iterate_next(TidStoreIter *iter)
if (BlockNumberIsValid(result->blkno))
{
+ /* Process the previously collected key-value */
result->num_offsets = 0;
tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
}
@@ -515,8 +524,8 @@ tidstore_iterate_next(TidStoreIter *iter)
if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
{
/*
- * Remember the key-value pair for the next block for the
- * next iteration.
+ * We got a key-value pair for a different block. So return the
+ * collected tids, and remember the key-value for the next iteration.
*/
iter->next_key = key;
iter->next_val = val;
@@ -531,7 +540,10 @@ tidstore_iterate_next(TidStoreIter *iter)
return result;
}
-/* Finish an iteration over TidStore */
+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
void
tidstore_end_iterate(TidStoreIter *iter)
{
@@ -544,7 +556,7 @@ tidstore_end_iterate(TidStoreIter *iter)
pfree(iter);
}
-/* Return the number of Tids we collected so far */
+/* Return the number of tids we collected so far */
int64
tidstore_num_tids(TidStore *ts)
{
@@ -552,7 +564,7 @@ tidstore_num_tids(TidStore *ts)
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
- if (TidStoreIsShared(ts))
+ if (!TidStoreIsShared(ts))
return ts->control->num_tids;
LWLockAcquire(&ts->control->lock, LW_SHARED);
@@ -593,9 +605,8 @@ tidstore_memory_usage(TidStore *ts)
*/
if (TidStoreIsShared(ts))
return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
- else
- return sizeof(TidStore) + sizeof(TidStore) +
- local_rt_memory_usage(ts->tree.local);
+
+ return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
}
/*
@@ -609,7 +620,7 @@ tidstore_get_handle(TidStore *ts)
return ts->control->handle;
}
-/* Extract Tids from the given key-value pair */
+/* Extract tids from the given key-value pair */
static void
tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
{
@@ -621,7 +632,10 @@ tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
OffsetNumber off;
if (i > iter->ts->control->max_offset)
+ {
+ Assert(!iter->ts->control->encode_tids);
break;
+ }
if ((val & (UINT64CONST(1) << i)) == 0)
continue;
@@ -644,8 +658,8 @@ key_get_blkno(TidStore *ts, uint64 key)
{
if (ts->control->encode_tids)
return (BlockNumber) (key >> ts->control->offset_key_nbits);
- else
- return (BlockNumber) key;
+
+ return (BlockNumber) key;
}
/* Encode a tid to key and offset */
--
2.31.1
v24-0009-Update-vacuum-integration-patch-from-v23.patchapplication/octet-stream; name=v24-0009-Update-vacuum-integration-patch-from-v23.patchDownload
From fd380a199f38545a56d7fa11c45ec088d62389f4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 31 Jan 2023 22:44:40 +0900
Subject: [PATCH v24 9/9] Update vacuum integration patch from v23.
---
src/backend/access/heap/vacuumlazy.c | 64 +++++++++++++--------------
src/backend/commands/vacuumparallel.c | 11 +++--
2 files changed, 37 insertions(+), 38 deletions(-)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3537df16fd..b4e40423a8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,18 @@
* vacuumlazy.c
* Concurrent ("lazy") vacuuming.
*
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs
* that are to be removed from indexes. We want to ensure we can vacuum even
* the very largest relations with finite memory space usage. To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
*
* We are willing to use at most maintenance_work_mem (or perhaps
* autovacuum_work_mem) memory space to keep track of dead TIDs. We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables). If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * create a TidStore with the maximum bytes that can be used by the TidStore.
+ * If the TidStore is full, we must call lazy_vacuum to vacuum indexes (and to
+ * vacuum the pages that we've pruned). This frees up the memory space dedicated
+ * to storing dead TIDs.
*
* In practice VACUUM will often complete its initial pass over the target
* heap relation without ever running out of space to store TIDs. This means
@@ -492,11 +492,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
}
/*
- * Allocate dead_items array memory using dead_items_alloc. This handles
- * parallel VACUUM initialization as part of allocating shared memory
- * space used for dead_items. (But do a failsafe precheck first, to
- * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
- * is already dangerously old.)
+ * Allocate dead_items memory using dead_items_alloc. This handles parallel
+ * VACUUM initialization as part of allocating shared memory space used for
+ * dead_items. (But do a failsafe precheck first, to ensure that parallel
+ * VACUUM won't be attempted at all when relfrozenxid is already dangerously
+ * old.)
*/
lazy_check_wraparound_failsafe(vacrel);
dead_items_alloc(vacrel, params->nworkers);
@@ -802,7 +802,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
* have collected the TIDs whose index tuples need to be removed.
*
* Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- * largely consists of marking LP_DEAD items (from collected TID array)
+ * largely consists of marking LP_DEAD items (from vacrel->dead_items)
* as LP_UNUSED. This has to happen in a second, final pass over the
* heap, to preserve a basic invariant that all index AMs rely on: no
* extant index tuple can ever be allowed to contain a TID that points to
@@ -973,7 +973,7 @@ lazy_scan_heap(LVRelState *vacrel)
continue;
}
- /* Collect LP_DEAD items in dead_items array, count tuples */
+ /* Collect LP_DEAD items in dead_items, count tuples */
if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
&recordfreespace))
{
@@ -1015,10 +1015,10 @@ lazy_scan_heap(LVRelState *vacrel)
* Prune, freeze, and count tuples.
*
* Accumulates details of remaining LP_DEAD line pointers on page in
- * dead_items array. This includes LP_DEAD line pointers that we
- * pruned ourselves, as well as existing LP_DEAD line pointers that
- * were pruned some time earlier. Also considers freezing XIDs in the
- * tuple headers of remaining items with storage.
+ * dead_items. This includes LP_DEAD line pointers that we pruned
+ * ourselves, as well as existing LP_DEAD line pointers that were pruned
+ * some time earlier. Also considers freezing XIDs in the tuple headers
+ * of remaining items with storage.
*/
lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
@@ -1084,7 +1084,7 @@ lazy_scan_heap(LVRelState *vacrel)
}
else if (prunestate.num_offsets > 0)
{
- /* Save details of the LP_DEAD items from the page */
+ /* Save details of the LP_DEAD items from the page in dead_items */
tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
prunestate.num_offsets);
@@ -1535,9 +1535,9 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
* The approach we take now is to restart pruning when the race condition is
* detected. This allows heap_page_prune() to prune the tuples inserted by
* the now-aborted transaction. This is a little crude, but it guarantees
- * that any items that make it into the dead_items array are simple LP_DEAD
- * line pointers, and that every remaining item with tuple storage is
- * considered as a candidate for freezing.
+ * that any items that make it into the dead_items are simple LP_DEAD line
+ * pointers, and that every remaining item with tuple storage is considered
+ * as a candidate for freezing.
*/
static void
lazy_scan_prune(LVRelState *vacrel,
@@ -1929,7 +1929,7 @@ retry:
* lazy_scan_prune, which requires a full cleanup lock. While pruning isn't
* performed here, it's quite possible that an earlier opportunistic pruning
* operation left LP_DEAD items behind. We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items for removal from indexes.
*
* For aggressive VACUUM callers, we may return false to indicate that a full
* cleanup lock is required for processing by lazy_scan_prune. This is only
@@ -2088,7 +2088,7 @@ lazy_scan_noprune(LVRelState *vacrel,
vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
vacrel->NewRelminMxid = NoFreezePageRelminMxid;
- /* Save any LP_DEAD items found on the page in dead_items array */
+ /* Save any LP_DEAD items found on the page in dead_items */
if (vacrel->nindexes == 0)
{
/* Using one-pass strategy (since table has no indexes) */
@@ -2373,9 +2373,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
/*
* lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
*
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
*
* We may also be able to truncate the line pointer array of the heap pages we
* visit. If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2461,7 +2460,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
ereport(DEBUG2,
(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
- vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+ vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2660,8 +2660,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
* lazy_vacuum_one_index() -- vacuum index relation.
*
* Delete all the index tuples containing a TID collected in
- * vacrel->dead_items array. Also update running statistics.
- * Exact details depend on index AM's ambulkdelete routine.
+ * vacrel->dead_items. Also update running statistics. Exact
+ * details depend on index AM's ambulkdelete routine.
*
* reltuples is the number of heap tuples to be passed to the
* bulkdelete callback. It's always assumed to be estimated.
@@ -3067,8 +3067,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
}
/*
- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
+ * in vacrel for caller.
*
* Also handles parallel initialization as part of allocating dead_items in
* DSM when required.
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 5c7e6ed99c..d653683693 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -9,12 +9,11 @@
* In a parallel vacuum, we perform both index bulk deletion and index cleanup
* with parallel worker processes. Individual indexes are processed by one
* vacuum process. ParalleVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment. We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit. Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * the shared TidStore. We launch parallel worker processes at the start of
+ * parallel index bulk-deletion and index cleanup and once all indexes are
+ * processed, the parallel worker processes exit. Each time we process indexes
+ * in parallel, the parallel context is re-initialized so that the same DSM can
+ * be used for multiple passes of index bulk-deletion and index cleanup.
*
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
--
2.31.1
v24-0007-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchapplication/octet-stream; name=v24-0007-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload
From 850aff99cfddb2e77822d616248a4550cdae269c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 17 Jan 2023 17:20:37 +0700
Subject: [PATCH v24 7/9] Use TIDStore for storing dead tuple TID during lazy
vacuum
Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which was not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.
Now we use TIDStore to store dead tuple TIDs. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.
Since we are no longer able to exactly estimate the maximum number of
TIDs can be stored the pg_stat_progress_vacuum shows the progress
information based on the amount of memory in bytes. The column names
are also changed to max_dead_tuple_bytes and num_dead_tuple_bytes.
In addition, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, the inital DSA
segment size. Due to that, we increase the minimum value of
maintenance_work_mem (also autovacuum_work_mem) from 1MB to 2MB.
XXX: needs to bump catalog version
---
doc/src/sgml/monitoring.sgml | 8 +-
src/backend/access/heap/vacuumlazy.c | 218 +++++++--------------
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 78 +-------
src/backend/commands/vacuumparallel.c | 62 +++---
src/backend/postmaster/autovacuum.c | 6 +-
src/backend/storage/lmgr/lwlock.c | 2 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/commands/progress.h | 4 +-
src/include/commands/vacuum.h | 25 +--
src/include/storage/lwlock.h | 1 +
src/test/regress/expected/cluster.out | 2 +-
src/test/regress/expected/create_index.out | 2 +-
src/test/regress/expected/rules.out | 4 +-
src/test/regress/sql/cluster.sql | 2 +-
src/test/regress/sql/create_index.sql | 2 +-
16 files changed, 142 insertions(+), 278 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d936aa3da3..0230c74e3d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6870,10 +6870,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>max_dead_tuples</structfield> <type>bigint</type>
+ <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples that we can store before needing to perform
+ Amount of dead tuple data that we can store before needing to perform
an index vacuum cycle, based on
<xref linkend="guc-maintenance-work-mem"/>.
</para></entry>
@@ -6881,10 +6881,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>num_dead_tuples</structfield> <type>bigint</type>
+ <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples collected since the last index vacuum cycle.
+ Amount of dead tuple data collected since the last index vacuum cycle.
</para></entry>
</row>
</tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..3537df16fd 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TidStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -220,11 +221,14 @@ typedef struct LVRelState
typedef struct LVPagePruneState
{
bool hastup; /* Page prevents rel truncation? */
- bool has_lpdead_items; /* includes existing LP_DEAD items */
+
+ /* collected offsets of LP_DEAD items including existing ones */
+ OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+ int num_offsets;
/*
* State describes the proper VM bit states to set for the page following
- * pruning and freezing. all_visible implies !has_lpdead_items, but don't
+ * pruning and freezing. all_visible implies num_offsets == 0, but don't
* trust all_frozen result unless all_visible is also set to true.
*/
bool all_visible; /* Every item visible to all? */
@@ -259,8 +263,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -825,21 +830,21 @@ lazy_scan_heap(LVRelState *vacrel)
blkno,
next_unskippable_block,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TidStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
const int initprog_index[] = {
PROGRESS_VACUUM_PHASE,
PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
- PROGRESS_VACUUM_MAX_DEAD_TUPLES
+ PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
};
int64 initprog_val[3];
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +911,7 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ if (tidstore_is_full(vacrel->dead_items))
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -1018,7 +1022,7 @@ lazy_scan_heap(LVRelState *vacrel)
*/
lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
- Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+ Assert(!prunestate.all_visible || (prunestate.num_offsets == 0));
/* Remember the location of the last page with nonremovable tuples */
if (prunestate.hastup)
@@ -1034,14 +1038,12 @@ lazy_scan_heap(LVRelState *vacrel)
* performed here can be thought of as the one-pass equivalent of
* a call to lazy_vacuum().
*/
- if (prunestate.has_lpdead_items)
+ if (prunestate.num_offsets > 0)
{
Size freespace;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
- /* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+ prunestate.num_offsets, buf, vmbuffer);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1080,16 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(tidstore_num_tids(dead_items) == 0);
+ }
+ else if (prunestate.num_offsets > 0)
+ {
+ /* Save details of the LP_DEAD items from the page */
+ tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
+ prunestate.num_offsets);
+
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
}
/*
@@ -1145,7 +1156,7 @@ lazy_scan_heap(LVRelState *vacrel)
* There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
* set, however.
*/
- else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+ else if ((prunestate.num_offsets > 0) && PageIsAllVisible(page))
{
elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
vacrel->relname, blkno);
@@ -1193,7 +1204,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Final steps for block: drop cleanup lock, record free space in the
* FSM
*/
- if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+ if ((prunestate.num_offsets > 0) && vacrel->do_index_vacuuming)
{
/*
* Wait until lazy_vacuum_heap_rel() to save free space. This
@@ -1249,7 +1260,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (tidstore_num_tids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1543,13 +1554,11 @@ lazy_scan_prune(LVRelState *vacrel,
HTSV_Result res;
int tuples_deleted,
tuples_frozen,
- lpdead_items,
live_tuples,
recently_dead_tuples;
int nnewlpdead;
HeapPageFreeze pagefrz;
int64 fpi_before = pgWalUsage.wal_fpi;
- OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1571,7 +1580,6 @@ retry:
pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
tuples_deleted = 0;
tuples_frozen = 0;
- lpdead_items = 0;
live_tuples = 0;
recently_dead_tuples = 0;
@@ -1580,9 +1588,9 @@ retry:
*
* We count tuples removed by the pruning step as tuples_deleted. Its
* final value can be thought of as the number of tuples that have been
- * deleted from the table. It should not be confused with lpdead_items;
- * lpdead_items's final value can be thought of as the number of tuples
- * that were deleted from indexes.
+ * deleted from the table. It should not be confused with
+ * prunestate->deadoffsets; prunestate->deadoffsets's final value can
+ * be thought of as the number of tuples that were deleted from indexes.
*/
tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
InvalidTransactionId, 0, &nnewlpdead,
@@ -1593,7 +1601,7 @@ retry:
* requiring freezing among remaining tuples with storage
*/
prunestate->hastup = false;
- prunestate->has_lpdead_items = false;
+ prunestate->num_offsets = 0;
prunestate->all_visible = true;
prunestate->all_frozen = true;
prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1638,7 +1646,7 @@ retry:
* (This is another case where it's useful to anticipate that any
* LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
*/
- deadoffsets[lpdead_items++] = offnum;
+ prunestate->deadoffsets[prunestate->num_offsets++] = offnum;
continue;
}
@@ -1875,7 +1883,7 @@ retry:
*/
#ifdef USE_ASSERT_CHECKING
/* Note that all_frozen value does not matter when !all_visible */
- if (prunestate->all_visible && lpdead_items == 0)
+ if (prunestate->all_visible && prunestate->num_offsets == 0)
{
TransactionId cutoff;
bool all_frozen;
@@ -1888,28 +1896,9 @@ retry:
}
#endif
- /*
- * Now save details of the LP_DEAD items from the page in vacrel
- */
- if (lpdead_items > 0)
+ if (prunestate->num_offsets > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
-
vacrel->lpdead_item_pages++;
- prunestate->has_lpdead_items = true;
-
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
/*
* It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1928,7 +1917,7 @@ retry:
/* Finally, add page-local counts to whole-VACUUM counts */
vacrel->tuples_deleted += tuples_deleted;
vacrel->tuples_frozen += tuples_frozen;
- vacrel->lpdead_items += lpdead_items;
+ vacrel->lpdead_items += prunestate->num_offsets;
vacrel->live_tuples += live_tuples;
vacrel->recently_dead_tuples += recently_dead_tuples;
}
@@ -2129,8 +2118,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TidStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2127,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2198,7 +2179,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
return;
}
@@ -2227,7 +2208,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2254,8 +2235,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2300,7 +2281,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
}
/*
@@ -2373,7 +2354,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2410,10 +2391,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index = 0;
BlockNumber vacuumed_pages = 0;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2410,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
VACUUM_ERRCB_PHASE_VACUUM_HEAP,
InvalidBlockNumber, InvalidOffsetNumber);
- while (index < vacrel->dead_items->num_items)
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ while ((result = tidstore_iterate_next(iter)) != NULL)
{
BlockNumber blkno;
Buffer buf;
@@ -2437,7 +2420,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ blkno = result->blkno;
vacrel->blkno = blkno;
/*
@@ -2451,7 +2434,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+ buf, vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2461,6 +2445,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
vacuumed_pages++;
}
+ tidstore_end_iterate(iter);
vacrel->blkno = InvalidBlockNumber;
if (BufferIsValid(vmbuffer))
@@ -2470,36 +2455,30 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
}
/*
- * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- * vacrel->dead_items array.
+ * lazy_vacuum_heap_page() -- free page's LP_DEAD items.
*
* Caller must have an exclusive buffer lock on the buffer (though a full
* cleanup lock is also acceptable). vmbuffer must be valid and already have
* a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page. The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *deadoffsets, int num_offsets, Buffer buffer,
+ Buffer vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int nunused = 0;
@@ -2518,16 +2497,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = deadoffsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -3093,46 +3066,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
return vacrel->nonempty_pages;
}
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
/*
* Allocate dead_items (either using palloc, or in dynamic shared memory).
* Sets dead_items in vacrel for caller.
@@ -3143,11 +3076,9 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
+ int vac_work_mem = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
@@ -3174,7 +3105,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
+ vac_work_mem, MaxHeapTuplesPerPage,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3187,11 +3118,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = tidstore_create(vac_work_mem, MaxHeapTuplesPerPage,
+ NULL);
}
/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..a526e607fe 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1165,7 +1165,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
END AS phase,
S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
- S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+ S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7b1a4b127e..d8e680ca20 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int vac_cmp_itemptr(const void *left, const void *right);
/*
* Primary entry point for manual VACUUM and ANALYZE commands
@@ -2303,16 +2302,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TidStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ tidstore_num_tids(dead_items))));
return istat;
}
@@ -2343,82 +2342,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
* This has the right signature to be an IndexBulkDeleteCallback.
- *
- * Assumes dead_items array is sorted (in ascending TID order).
*/
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch((void *) itemptr,
- (void *) dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
-
- return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
- BlockNumber lblk,
- rblk;
- OffsetNumber loff,
- roff;
-
- lblk = ItemPointerGetBlockNumber((ItemPointer) left);
- rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
- if (lblk < rblk)
- return -1;
- if (lblk > rblk)
- return 1;
-
- loff = ItemPointerGetOffsetNumber((ItemPointer) left);
- roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
- if (loff < roff)
- return -1;
- if (loff > roff)
- return 1;
+ TidStore *dead_items = (TidStore *) state;
- return 0;
+ return tidstore_lookup_tid(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..5c7e6ed99c 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -103,6 +103,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TidStore */
+ tidstore_handle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ dsa_area *dead_items_area;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int max_offset, int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = tidstore_create(vac_work_mem, max_offset, dead_items_dsa);
+ pvs->dead_items = dead_items;
+ pvs->dead_items_area = dead_items_dsa;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = tidstore_get_handle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ tidstore_destroy(pvs->dead_items);
+ dsa_detach(pvs->dead_items_area);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TidStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ tidstore_detach(pvs.dead_items);
+ dsa_detach(dead_items_area);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index f5ea381c53..d88db3e1f8 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3397,12 +3397,12 @@ check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
return true;
/*
- * We clamp manually-set values to at least 1MB. Since
+ * We clamp manually-set values to at least 2MB. Since
* maintenance_work_mem is always set to at least this value, do the same
* here.
*/
- if (*newval < 1024)
- *newval = 1024;
+ if (*newval < 2048)
+ *newval = 2048;
return true;
}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 55b3a04097..c223a7dc94 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -192,6 +192,8 @@ static const char *const BuiltinTrancheNames[] = {
"LogicalRepLauncherDSA",
/* LWTRANCHE_LAUNCHER_HASH: */
"LogicalRepLauncherHash",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index c5a95f5dcc..a8e7041395 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2312,7 +2312,7 @@ struct config_int ConfigureNamesInt[] =
GUC_UNIT_KB
},
&maintenance_work_mem,
- 65536, 1024, MAX_KILOBYTES,
+ 65536, 2048, MAX_KILOBYTES,
NULL, NULL, NULL
},
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
#define PROGRESS_VACUUM_HEAP_BLKS_SCANNED 2
#define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED 3
#define PROGRESS_VACUUM_NUM_INDEX_VACUUMS 4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES 5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES 5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES 6
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..a3ebb169ef 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
MultiXactId MultiXactCutoff;
};
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TidStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int vac_work_mem, int max_offset,
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 07002fdfbe..537b34b30c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DATA,
LWTRANCHE_LAUNCHER_DSA,
LWTRANCHE_LAUNCHER_HASH,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
-- ensure we don't use the index in CLUSTER nor the checking SELECTs
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
-- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..f6ae02eb14 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param3 AS heap_blks_scanned,
s.param4 AS heap_blks_vacuumed,
s.param5 AS index_vacuum_count,
- s.param6 AS max_dead_tuples,
- s.param7 AS num_dead_tuples
+ s.param6 AS max_dead_tuple_bytes,
+ s.param7 AS dead_tuple_bytes
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
--
2.31.1
v24-0006-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchapplication/octet-stream; name=v24-0006-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload
From 9bb09e2742c2c8aa21802697c33fb3357f7516d9 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v24 6/9] Add TIDStore, to store sets of TIDs (ItemPointerData)
efficiently.
The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.
The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.
This includes a unit test module, in src/test/modules/test_tidstore.
---
doc/src/sgml/monitoring.sgml | 4 +
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 674 ++++++++++++++++++
src/backend/storage/lmgr/lwlock.c | 2 +
src/include/access/tidstore.h | 49 ++
src/include/storage/lwlock.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_tidstore/Makefile | 23 +
.../test_tidstore/expected/test_tidstore.out | 13 +
src/test/modules/test_tidstore/meson.build | 35 +
.../test_tidstore/sql/test_tidstore.sql | 7 +
.../test_tidstore/test_tidstore--1.0.sql | 8 +
.../modules/test_tidstore/test_tidstore.c | 195 +++++
.../test_tidstore/test_tidstore.control | 4 +
16 files changed, 1019 insertions(+)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
create mode 100644 src/test/modules/test_tidstore/Makefile
create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
create mode 100644 src/test/modules/test_tidstore/meson.build
create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
create mode 100644 src/test/modules/test_tidstore/test_tidstore.control
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..d936aa3da3 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2192,6 +2192,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Waiting to access a shared TID bitmap during a parallel bitmap
index scan.</entry>
</row>
+ <row>
+ <entry><literal>SharedTidStore</literal></entry>
+ <entry>Waiting to access a shared TID store.</entry>
+ </row>
<row>
<entry><literal>SharedTupleStore</literal></entry>
<entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..89aea71945
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,674 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach().
+ *
+ * XXX: Only one process is allowed to iterate over the TidStore at a time.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, tids are represented as a pair of 64-bit key and
+ * 64-bit value. First, we construct 64-bit unsigned integer by combining
+ * the block number and the offset number. The number of bits used for the
+ * offset number is specified by max_offsets in tidstore_create(). We are
+ * frugal with the bits, because smaller keys could help keeping the radix
+ * tree shallow.
+ *
+ * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. That
+ * is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits
+ * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
+ * as the key:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ * |----| value
+ * |---------------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ *
+ * If the number of bits for offset number fits in a 64-bit value, we don't
+ * encode tids but directly use the block number and the offset number as key
+ * and value, respectively.
+ */
+#define TIDSTORE_VALUE_NBITS 6 /* log(64, 2) */
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The header object for a TidStore */
+typedef struct TidStoreControl
+{
+ int64 num_tids; /* the number of Tids stored so far */
+ size_t max_bytes; /* the maximum bytes a TidStore can use */
+ int max_offset; /* the maximum offset number */
+ bool encode_tids; /* do we use tid encoding? */
+ int offset_nbits; /* the number of bits used for offset number */
+ int offset_key_nbits; /* the number of bits of a offset number
+ * used for the key */
+
+ /* The below fields are used only in shared case */
+
+ uint32 magic;
+ LWLock lock;
+
+ /* handles for TidStore and radix tree */
+ tidstore_handle handle;
+ shared_rt_handle tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+ /*
+ * Control object. This is allocated in DSA area 'area' in the shared
+ * case, otherwise in backend-local memory.
+ */
+ TidStoreControl *control;
+
+ /* Storage for Tids. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ local_rt_radix_tree *local;
+ shared_rt_radix_tree *shared;
+ } tree;
+
+ /* DSA area for TidStore if used */
+ dsa_area *area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+ TidStore *ts;
+
+ /* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ shared_rt_iter *shared;
+ local_rt_iter *local;
+ } tree_iter;
+
+ /* we returned all tids? */
+ bool finished;
+
+ /* save for the next iteration */
+ uint64 next_key;
+ uint64 next_val;
+
+ /* output for the caller */
+ TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+{
+ TidStore *ts;
+
+ ts = palloc0(sizeof(TidStore));
+
+ /*
+ * Create the radix tree for the main storage.
+ *
+ * Memory consumption depends on the number of Tids stored, but also on the
+ * distribution of them, how the radix tree stores, and the memory management
+ * that backed the radix tree. The maximum bytes that a TidStore can
+ * use is specified by the max_bytes in tidstore_create(). We want the total
+ * amount of memory consumption not to exceed the max_bytes.
+ *
+ * In non-shared cases, the radix tree uses slab allocators for each kind of
+ * node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate the
+ * largest radix tree node in a new slab block, which is approximately 70kB.
+ * Therefore, we deduct 70kB from the maximum bytes.
+ *
+ * In shared cases, DSA allocates the memory segments big enough to follow
+ * a geometric series that approximately doubles the total DSA size (see
+ * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+ * size and the simulation revealed, the 75% threshold for the maximum bytes
+ * perfectly works in case where it is a power-of-2, and the 60% threshold
+ * works for other cases.
+ */
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+ float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+ LWTRANCHE_SHARED_TIDSTORE);
+
+ dp = dsa_allocate0(area, sizeof(TidStoreControl));
+ ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+ ts->control->max_bytes =(uint64) (max_bytes * ratio);
+ ts->area = area;
+
+ ts->control->magic = TIDSTORE_MAGIC;
+ LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+ ts->control->handle = dp;
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+ }
+ else
+ {
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+ ts->control->max_bytes = max_bytes - (1024 * 70);
+ }
+
+ ts->control->max_offset = max_offset;
+ ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+
+ if (ts->control->offset_nbits > TIDSTORE_VALUE_NBITS)
+ {
+ ts->control->encode_tids = true;
+ ts->control->offset_key_nbits =
+ ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+ }
+ else
+ {
+ ts->control->encode_tids = false;
+ ts->control->offset_key_nbits = 0;
+ }
+
+ return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ /* create per-backend state */
+ ts = palloc0(sizeof(TidStore));
+
+ /* Find the control object in shared memory */
+ control = handle;
+
+ /* Set up the TidStore */
+ ts->control = (TidStoreControl *) dsa_get_address(area, control);
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+ ts->area = area;
+
+ return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ shared_rt_detach(ts->tree.shared);
+ pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix
+ * tree.
+ */
+ ts->control->magic = 0;
+ dsa_free(ts->area, ts->control->handle);
+ shared_rt_free(ts->tree.shared);
+ }
+ else
+ {
+ pfree(ts->control);
+ local_rt_free(ts->tree.local);
+ }
+
+ pfree(ts);
+}
+
+/* Forget all collected Tids */
+void
+tidstore_reset(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /*
+ * Free the radix tree and return allocated DSA segments to
+ * the operating system.
+ */
+ shared_rt_free(ts->tree.shared);
+ dsa_trim(ts->area);
+
+ /* Recreate the radix tree */
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+ LWTRANCHE_SHARED_TIDSTORE);
+
+ /* update the radix tree handle as we recreated it */
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+
+ LWLockRelease(&ts->control->lock);
+ }
+ else
+ {
+ local_rt_free(ts->tree.local);
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+ }
+}
+
+static inline void
+tidstore_insert_kv(TidStore *ts, uint64 key, uint64 val)
+{
+ if (TidStoreIsShared(ts))
+ shared_rt_set(ts->tree.shared, key, val);
+ else
+ local_rt_set(ts->tree.local, key, val);
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ ItemPointerData tid;
+ uint64 key_base;
+ uint64 *values;
+ int nkeys;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ ItemPointerSetBlockNumber(&tid, blkno);
+
+ if (ts->control->encode_tids)
+ {
+ key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+ nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+ }
+ else
+ {
+ key_base = (uint64) blkno;
+ nkeys = 1;
+ }
+
+ values = palloc0(sizeof(uint64) * nkeys);
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint64 key;
+ uint32 off;
+ int idx;
+
+ ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+ /* encode the tid to key and val */
+ key = tid_to_key_off(ts, &tid, &off);
+
+ idx = key - key_base;
+ Assert(idx >= 0 && idx < nkeys);
+
+ values[idx] |= UINT64CONST(1) << off;
+ }
+
+ if (TidStoreIsShared(ts))
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /* insert the calculated key-values to the tree */
+ for (int i = 0; i < nkeys; i++)
+ {
+ if (values[i])
+ {
+ uint64 key = key_base + i;
+
+ tidstore_insert_kv(ts, key, values[i]);
+ }
+ }
+
+ /* update statistics */
+ ts->control->num_tids += num_offsets;
+
+ if (TidStoreIsShared(ts))
+ LWLockRelease(&ts->control->lock);
+
+ pfree(values);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val = 0;
+ uint32 off;
+ bool found;
+
+ key = tid_to_key_off(ts, tid, &off);
+
+ if (TidStoreIsShared(ts))
+ found = shared_rt_search(ts->tree.shared, key, &val);
+ else
+ found = local_rt_search(ts->tree.local, key, &val);
+
+ if (!found)
+ return false;
+
+ return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. The caller must be certain that
+ * no other backend will attempt to update the TidStore during the iteration.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+ TidStoreIter *iter;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ iter = palloc0(sizeof(TidStoreIter));
+ iter->ts = ts;
+
+ iter->result.blkno = InvalidBlockNumber;
+ iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
+
+ if (TidStoreIsShared(ts))
+ iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+ else
+ iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+ /* If the TidStore is empty, there is no business */
+ if (tidstore_num_tids(ts) == 0)
+ iter->finished = true;
+
+ return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+ else
+ return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a TidStoreIterResult representing Tids
+ * in one page. Offset numbers in the result is sorted.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+ TidStoreIterResult *result = &(iter->result);
+
+ if (iter->finished)
+ return NULL;
+
+ if (BlockNumberIsValid(result->blkno))
+ {
+ result->num_offsets = 0;
+ tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (tidstore_iter_kv(iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = key_get_blkno(iter->ts, key);
+
+ if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+ {
+ /*
+ * Remember the key-value pair for the next block for the
+ * next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+ return result;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_extract_tids(iter, key, val);
+ }
+
+ iter->finished = true;
+ return result;
+}
+
+/* Finish an iteration over TidStore */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+ if (TidStoreIsShared(iter->ts))
+ shared_rt_end_iterate(iter->tree_iter.shared);
+ else
+ local_rt_end_iterate(iter->tree_iter.local);
+
+ pfree(iter->result.offsets);
+ pfree(iter);
+}
+
+/* Return the number of Tids we collected so far */
+int64
+tidstore_num_tids(TidStore *ts)
+{
+ uint64 num_tids;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ if (TidStoreIsShared(ts))
+ return ts->control->num_tids;
+
+ LWLockAcquire(&ts->control->lock, LW_SHARED);
+ num_tids = ts->control->num_tids;
+ LWLockRelease(&ts->control->lock);
+
+ return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+tidstore_max_memory(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+tidstore_memory_usage(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+ else
+ return sizeof(TidStore) + sizeof(TidStore) +
+ local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->handle;
+}
+
+/* Extract Tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+ TidStoreIterResult *result = (&iter->result);
+
+ for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ if (i > iter->ts->control->max_offset)
+ break;
+
+ if ((val & (UINT64CONST(1) << i)) == 0)
+ continue;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= i;
+
+ off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+
+ Assert(result->num_offsets < iter->ts->control->max_offset);
+ result->offsets[result->num_offsets++] = off;
+ }
+
+ result->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, uint64 key)
+{
+ if (ts->control->encode_tids)
+ return (BlockNumber) (key >> ts->control->offset_key_nbits);
+ else
+ return (BlockNumber) key;
+}
+
+/* Encode a tid to key and offset */
+static inline uint64
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off)
+{
+ uint64 key;
+ uint64 tid_i;
+
+ if (!ts->control->encode_tids)
+ {
+ *off = ItemPointerGetOffsetNumber(tid);
+
+ /* Use the block number as the key */
+ return (int64) ItemPointerGetBlockNumber(tid);
+ }
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << ts->control->offset_nbits;
+
+ *off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1);
+ key = tid_i >> TIDSTORE_VALUE_NBITS;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
"SharedTupleStore",
/* LWTRANCHE_SHARED_TIDBITMAP: */
"SharedTidBitmap",
+ /* LWTRANCHE_SHARED_TIDSTORE: */
+ "SharedTidStore",
/* LWTRANCHE_PARALLEL_APPEND: */
"ParallelAppend",
/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..a35a52124a
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+ BlockNumber blkno;
+ OffsetNumber *offsets;
+ int num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern int64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern size_t tidstore_max_memory(TidStore *ts);
+extern size_t tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif /* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
LWTRANCHE_SHARED_TUPLESTORE,
LWTRANCHE_SHARED_TIDBITMAP,
+ LWTRANCHE_SHARED_TIDSTORE,
LWTRANCHE_PARALLEL_APPEND,
LWTRANCHE_PER_XACT_PREDICATE_LIST,
LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
test_rls_hooks \
test_shm_mq \
test_slru \
+ test_tidstore \
unsafe_tests \
worker_spi
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
subdir('test_rls_hooks')
subdir('test_shm_mq')
subdir('test_slru')
+subdir('test_tidstore')
subdir('unsafe_tests')
subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+ $(WIN32RES) \
+ test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE: testing empty tidstore
+NOTICE: testing basic operations
+ test_tidstore
+---------------
+
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+ 'test_tidstore.c',
+)
+
+if host_system == 'windows'
+ test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_tidstore',
+ '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+ test_tidstore_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+ 'test_tidstore.control',
+ 'test_tidstore--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_tidstore',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_tidstore',
+ ],
+ },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..9b849ae8e8
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,195 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ * Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+ ItemPointerData tid;
+ bool found;
+
+ ItemPointerSet(&tid, blkno, off);
+
+ found = tidstore_lookup_tid(ts, &tid);
+
+ if (found != expect)
+ elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+ blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS 5
+#define TEST_TIDSTORE_NUM_OFFSETS 5
+
+ TidStore *ts;
+ TidStoreIter *iter;
+ TidStoreIterResult *iter_result;
+ BlockNumber blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+ };
+ BlockNumber blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+ };
+ OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+ int blk_idx;
+
+ /* prepare the offset array */
+ offs[0] = FirstOffsetNumber;
+ offs[1] = FirstOffsetNumber + 1;
+ offs[2] = max_offset / 2;
+ offs[3] = max_offset - 1;
+ offs[4] = max_offset;
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+
+ /* add tids */
+ for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+ tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* lookup test */
+ for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+ {
+ bool expect = false;
+ for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+ {
+ if (offs[i] == off)
+ {
+ expect = true;
+ break;
+ }
+ }
+
+ check_tid(ts, 0, off, expect);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, expect);
+ }
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+ elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+ tidstore_num_tids(ts),
+ TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* iteration test */
+ iter = tidstore_begin_iterate(ts);
+ blk_idx = 0;
+ while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+ {
+ /* check the returned block number */
+ if (blks_sorted[blk_idx] != iter_result->blkno)
+ elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+ iter_result->blkno, blks_sorted[blk_idx]);
+
+ /* check the returned offset numbers */
+ if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+ elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+ iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+ for (int i = 0; i < iter_result->num_offsets; i++)
+ {
+ if (offs[i] != iter_result->offsets[i])
+ elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+ iter_result->offsets[i], iter_result->blkno, offs[i]);
+ }
+
+ blk_idx++;
+ }
+
+ if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+ elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+ blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+ /* remove all tids */
+ tidstore_reset(ts);
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+ /* lookup test for empty store */
+ for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+ off++)
+ {
+ check_tid(ts, 0, off, false);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, false);
+ }
+
+ tidstore_destroy(ts);
+}
+
+static void
+test_empty(void)
+{
+ TidStore *ts;
+ TidStoreIter *iter;
+ ItemPointerData tid;
+
+ elog(NOTICE, "testing empty tidstore");
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+
+ ItemPointerSet(&tid, 0, FirstOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+ ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+ MaxBlockNumber, MaxOffsetNumber);
+
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+ if (tidstore_is_full(ts))
+ elog(ERROR, "tidstore_is_full on empty store returned true");
+
+ iter = tidstore_begin_iterate(ts);
+
+ if (tidstore_iterate_next(iter) != NULL)
+ elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+ tidstore_end_iterate(iter);
+
+ tidstore_destroy(ts);
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ elog(NOTICE, "testing basic operations");
+ test_basic(MaxHeapTuplesPerPage);
+ test_basic(10);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
--
2.31.1
v24-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchapplication/octet-stream; name=v24-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload
From f4ae4a7c957b5e9351607ffbd85cd044ed09c339 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v24 2/9] Move some bitmap logic out of bitmapset.c
Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.
Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().
Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.
Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
src/backend/nodes/bitmapset.c | 36 ++------------------------------
src/include/nodes/bitmapset.h | 16 ++++++++++++--
src/include/port/pg_bitutils.h | 31 +++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 -
4 files changed, 47 insertions(+), 37 deletions(-)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x))
/*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
{
int result;
- w = RIGHTMOST_ONE(w);
+ w = bmw_rightmost_one(w);
a->words[wordnum] &= ~w;
result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 0dca6bc5fa..80e91fac0f 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
#else
#define BITS_PER_BITMAPWORD 32
typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
#endif
@@ -75,6 +73,20 @@ typedef enum
BMS_MULTIPLE /* >1 member */
} BMS_Membership;
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w) pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
+#define bmw_popcount(w) pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w) pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
+#define bmw_popcount(w) pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
/*
* function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+ int32 result = (int32) word & -((int32) word);
+ return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+ int64 result = (int64) word & -((int64) word);
+ return (uint64) result;
+}
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 07fbb7ccf6..f4d1d60cd2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3662,7 +3662,6 @@ shmem_request_hook_type
shmem_startup_hook_type
sig_atomic_t
sigjmp_buf
-signedbitmapword
sigset_t
size_t
slist_head
--
2.31.1
v24-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/octet-stream; name=v24-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload
From a42eb01c87675698ae5972421f8896f85f048f2b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v24 1/9] introduce vector8_min and vector8_highbit_mask
TODO: commit message
TODO: Remove uint64 case.
separate-commit TODO: move non-SIMD fallbacks to own header
to clean up the #ifdef maze.
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..f0bba33c53 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
static inline bool vector8_has_zero(const Vector8 v);
static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
#endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
#endif
}
+/*
+ * Return the bitmask of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+ uint32 mask = 0;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+ return mask;
+#endif
+}
+
/*
* Exactly like vector8_is_highbit_set except for the input type, so it
* looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.31.1
v24-0003-Add-radixtree-template.patchapplication/octet-stream; name=v24-0003-Add-radixtree-template.patchDownload
From 3d16fe0d216f4efb093dd880da02a6e54651d109 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v24 3/9] Add radixtree template
The only thing configurable in this commit is function scope,
prefix, and local/shared memory.
The key and value type are still hard-coded to uint64.
(A later commit in v21 will make value type configurable)
It might be good at some point to offer a different tree type,
e.g. "single-value leaves" to allow for variable length keys
and values, giving full flexibility to developers.
TODO: Much broader commit message
---
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 2426 +++++++++++++++++
src/include/lib/radixtree_delete_impl.h | 106 +
src/include/lib/radixtree_insert_impl.h | 317 +++
src/include/lib/radixtree_iter_impl.h | 138 +
src/include/lib/radixtree_search_impl.h | 122 +
src/include/utils/dsa.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 36 +
src/test/modules/test_radixtree/meson.build | 35 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 673 +++++
.../test_radixtree/test_radixtree.control | 4 +
src/tools/pginclude/cpluspluscheck | 6 +
src/tools/pginclude/headerscheck | 6 +
20 files changed, 3933 insertions(+)
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/include/lib/radixtree_delete_impl.h
create mode 100644 src/include/lib/radixtree_insert_impl.h
create mode 100644 src/include/lib/radixtree_iter_impl.h
create mode 100644 src/include/lib/radixtree_search_impl.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..80555aefff 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..f591d903fc
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2426 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ * tional leaf node type which stores one value.
+ * - Multi-value leaves: The values are stored in one of four
+ * different leaf node types, which mirror the structure of
+ * inner nodes, but contain values instead of pointers.
+ * - Combined pointer/value slots: If values fit into point-
+ * ers, no separate node types are necessary. Instead, each
+ * pointer storage location in an inner node can either
+ * store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included. Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * will result in radix tree type 'foo_radix_tree' and functions like
+ * 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ * generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ * declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
+ *
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ * so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE - Create a new, empty radix tree
+ * RT_FREE - Free the radix tree
+ * RT_SEARCH - Search a key-value pair
+ * RT_SET - Set a key-value pair
+ * RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT - Return next key-value pair, if any
+ * RT_END_ITER - End iteration
+ * RT_MEMORY_USAGE - Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH - Attach to the radix tree
+ * RT_DETACH - Detach from the radix tree
+ * RT_GET_HANDLE - Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE - Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *val_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE val);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif /* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
+#define BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ * statements.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ * in the future to tag the node pointer with the kind, even on
+ * platforms with 32-bit pointers. This might speed up node traversal
+ * in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_3 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Max capacity for the current size class. Storing this in the
+ * node enables multiple size classes per node kind.
+ * Technically, kinds with a single size class don't need this, so we could
+ * keep this in the individual base types, but the code is simpler this way.
+ * Note: node256 is unique in that it cannot possibly have more than a
+ * single size class, so for that kind we store zero, and uint8 is
+ * sufficient for other kinds.
+ */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+#define RT_NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
+
+#define RT_NODE_MUST_GROW(node) \
+ ((node)->base.n.count == (node)->base.n.fanout)
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_3
+{
+ RT_NODE n;
+
+ /* 3 children, for key chunks */
+ uint8 chunks[3];
+} RT_NODE_BASE_3;
+
+typedef struct RT_NODE_BASE_32
+{
+ RT_NODE n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_125
+{
+ RT_NODE n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* isset is a bitmap to track which slot is in use */
+ bitmapword isset[BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+ RT_NODE n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_3
+{
+ RT_NODE_BASE_3 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_3;
+
+typedef struct RT_NODE_LEAF_3
+{
+ RT_NODE_BASE_3 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_3;
+
+typedef struct RT_NODE_INNER_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+ RT_NODE_BASE_256 base;
+
+ /* Slots for 256 children */
+ RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+ RT_NODE_BASE_256 base;
+
+ /*
+ * Unlike with inner256, zero is a valid value here, so we use a
+ * bitmap to track which slot is in use.
+ */
+ bitmapword isset[BM_IDX(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ RT_VALUE_TYPE values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+ RT_CLASS_3 = 0,
+ RT_CLASS_32_MIN,
+ RT_CLASS_32_MAX,
+ RT_CLASS_125,
+ RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+} RT_SIZE_CLASS_ELEM;
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+ [RT_CLASS_3] = {
+ .name = "radix tree node 3",
+ .fanout = 3,
+ .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_32_MIN] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_32_MAX] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_125] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(RT_NODE_INNER_256),
+ .leaf_size = sizeof(RT_NODE_LEAF_256),
+ },
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+ RT_HANDLE handle;
+ uint32 magic;
+#endif
+
+ RT_PTR_ALLOC root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+ MemoryContext context;
+
+ /* pointing to either local memory or DSA */
+ RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+ dsa_area *dsa;
+#else
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+ RT_PTR_LOCAL node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+ RT_RADIX_TREE *tree;
+
+ /* Track the iteration on nodes of each level */
+ RT_NODE_ITER stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is constructed during iteration */
+ uint64 key;
+} RT_ITER;
+
+
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_VALUE_TYPE value);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+ return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+ return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+ return DsaPointerIsValid(ptr);
+#else
+ return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ /* replicate the search key */
+ spread_chunk = vector8_broadcast(chunk);
+
+ /* compare to all 32 keys stored in the node */
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+
+ /* convert comparison to a bitfield */
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+ /* mask off invalid entries */
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ /* convert bitfield to index by counting trailing zeros */
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ /*
+ * This is coded with '>=' to match what we can do with SIMD,
+ * with an assert to keep us honest.
+ */
+ if (node->chunks[index] >= chunk)
+ {
+ Assert(node->chunks[index] != chunk);
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ /*
+ * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+ * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+ * we need to play some trickery using vector8_min() to effectively get
+ * <=. There'll never be any equal elements in urrent uses, but that's
+ * what we get here...
+ */
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+ uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+ uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+ Assert(RT_NODE_IS_LEAF(node));
+ Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+ return node->children[chunk];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ Assert(RT_NODE_IS_LEAF(node));
+ Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ node->isset[idx] |= ((bitmapword) 1 << bitnum);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = BM_IDX(chunk);
+ int bitnum = BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+ if (key == 0)
+ return 0;
+ else
+ return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+ RT_PTR_ALLOC allocnode;
+ size_t allocsize;
+
+ if (is_leaf)
+ allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+ else
+ allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+ allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+ if (is_leaf)
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ allocsize);
+ else
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+ allocsize);
+#endif
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->ctl->cnt[size_class]++;
+#endif
+
+ return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+ if (is_leaf)
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+ else
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+
+ node->kind = kind;
+
+ if (kind == RT_NODE_KIND_256)
+ /* See comment for the RT_NODE type */
+ Assert(node->fanout == 0);
+ else
+ node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+ memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
+ }
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+ int shift = RT_KEY_GET_SHIFT(key);
+ bool is_leaf = shift == 0;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ newnode->shift = shift;
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+ tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->count = oldnode->count;
+}
+
+/*
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
+ */
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+ uint8 new_kind, uint8 new_class, bool is_leaf)
+{
+ RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
+ RT_COPY_NODE(newnode, node);
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->ctl->root == allocnode)
+ {
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+ tree->ctl->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->ctl->cnt[i]--;
+ Assert(tree->ctl->cnt[i] >= 0);
+ }
+#endif
+
+#ifdef RT_SHMEM
+ dsa_free(tree->dsa, allocnode);
+#else
+ pfree(allocnode);
+#endif
+}
+
+/* Update the parent's pointer when growing a node */
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
+ RT_PTR_ALLOC new_child, uint64 key)
+{
+#ifdef USE_ASSERT_CHECKING
+ RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+ Assert(old_child->shift == new->shift);
+ Assert(old_child->count == new->count);
+#endif
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new larger node */
+ tree->ctl->root = new_child;
+ }
+ else
+ RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+ RT_FREE_NODE(tree, stored_old_child);
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+ int target_shift;
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ int shift = root->shift + RT_NODE_SPAN;
+
+ target_shift = RT_KEY_GET_SHIFT(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ RT_NODE_INNER_3 *n3;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+ node->shift = shift;
+ node->count = 1;
+
+ n3 = (RT_NODE_INNER_3 *) node;
+ n3->base.chunks[0] = 0;
+ n3->children[0] = tree->ctl->root;
+
+ /* Update the root */
+ tree->ctl->root = allocnode;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
+{
+ int shift = node->shift;
+
+ Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ RT_PTR_ALLOC allocchild;
+ RT_PTR_LOCAL newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool is_leaf = newshift == 0;
+
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ newchild->shift = newshift;
+ RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
+
+ parent = node;
+ node = newchild;
+ stored_node = allocchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value);
+ tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static bool
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_VALUE_TYPE value)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+ RT_RADIX_TREE *tree;
+ MemoryContext old_ctx;
+#ifdef RT_SHMEM
+ dsa_pointer dp;
+#endif
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+ tree->context = ctx;
+
+#ifdef RT_SHMEM
+ tree->dsa = dsa;
+ dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+ tree->ctl->handle = dp;
+ tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+#else
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+ /* Create a slab context for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+ size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+ size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ size_class.name,
+ inner_blocksize,
+ size_class.inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ size_class.name,
+ leaf_blocksize,
+ size_class.leaf_size);
+ }
+#endif
+
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+ RT_RADIX_TREE *tree;
+ dsa_pointer control;
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ tree->dsa = dsa;
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static inline void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();
+
+ /* The leaf node doesn't have child pointers */
+ if (RT_NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->dsa, ptr);
+ return;
+ }
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+ for (int i = 0; i < n3->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n3->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ for (int i = 0; i < n32->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+ }
+
+ break;
+ }
+ }
+
+ /* Free the inner node */
+ dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ /* Free all memory used for radix tree nodes */
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_FREE_RECURSE(tree, tree->ctl->root);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->dsa, tree->ctl->handle);
+#else
+ pfree(tree->ctl);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+#endif
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
+{
+ int shift;
+ bool updated;
+ RT_PTR_LOCAL parent;
+ RT_PTR_ALLOC stored_child;
+ RT_PTR_LOCAL child;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ /* Empty tree, create the root */
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_NEW_ROOT(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->ctl->max_val)
+ RT_EXTEND(tree, key);
+
+ stored_child = tree->ctl->root;
+ parent = RT_PTR_GET_LOCAL(tree, stored_child);
+ shift = parent->shift;
+
+ /* Descend the tree until we reach a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;;
+
+ child = RT_PTR_GET_LOCAL(tree, stored_child);
+
+ if (RT_NODE_IS_LEAF(child))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
+ {
+ RT_SET_EXTEND(tree, key, value, parent, stored_child, child);
+ return false;
+ }
+
+ parent = child;
+ stored_child = new_child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->ctl->num_keys++;
+
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+ Assert(value_p != NULL);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ return false;
+
+ node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ shift = node->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+ if (RT_NODE_IS_LEAF(node))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ return false;
+
+ node = RT_PTR_GET_LOCAL(tree, child);
+ shift -= RT_NODE_SPAN;
+ }
+
+ return RT_NODE_SEARCH_LEAF(node, key, value_p);
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ return false;
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+ /* Push the current node to the stack */
+ stack[++level] = allocnode;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ return false;
+
+ allocnode = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_LEAF(node, key);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->ctl->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (node->count > 0)
+ return true;
+
+ /* Free the empty leaf node */
+ RT_FREE_NODE(tree, allocnode);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ allocnode = stack[level--];
+
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_INNER(node, key);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (node->count > 0)
+ break;
+
+ /* The node became empty */
+ RT_FREE_NODE(tree, allocnode);
+ }
+
+ return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+ RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+ int level = from;
+ RT_PTR_LOCAL node = from_node;
+
+ for (;;)
+ {
+ RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (RT_NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/* Create and return the iterator for the given radix tree */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+ MemoryContext old_ctx;
+ RT_ITER *iter;
+ RT_PTR_LOCAL root;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+ iter->tree = tree;
+
+ /* empty tree */
+ if (!iter->tree->ctl->root)
+ return iter;
+
+ root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+ top_level = root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise,
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->ctl->root)
+ return false;
+
+ for (;;)
+ {
+ RT_PTR_LOCAL child = NULL;
+ RT_VALUE_TYPE value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+ pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+ Size total = 0;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ total = dsa_get_total_size(tree->dsa);
+#else
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+#endif
+
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
+
+ for (int i = 1; i < n3->n.count; i++)
+ Assert(n3->chunks[i - 1] < n3->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ uint8 slot = n125->slot_idxs[i];
+ int idx = BM_IDX(slot);
+ int bitnum = BM_BIT(slot);
+
+ if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(slot < node->fanout);
+ Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
+ cnt += bmw_popcount(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+ fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+ fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+
+ fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+ root->shift / RT_NODE_SPAN,
+ tree->ctl->cnt[RT_CLASS_3],
+ tree->ctl->cnt[RT_CLASS_32_MIN],
+ tree->ctl->cnt[RT_CLASS_32_MAX],
+ tree->ctl->cnt[RT_CLASS_125],
+ tree->ctl->cnt[RT_CLASS_256]);
+ }
+}
+
+static void
+RT_DUMP_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, int level,
+ bool recurse, StringInfo buf)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+ StringInfoData spaces;
+
+ initStringInfo(&spaces);
+ appendStringInfoSpaces(&spaces, (level * 4) + 1);
+
+ appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u, shift %u:\n",
+ spaces.data,
+ level == 0 ? "" : "-> ",
+ RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_3) ? 3 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+ spaces.data, i, n3->base.chunks[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X",
+ spaces.data, i, n3->base.chunks[i]);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, n3->children[i], level + 1,
+ recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+ spaces.data, i, n32->base.chunks[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X",
+ spaces.data, i, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, n32->children[i], level + 1,
+ recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+ char *sep = "";
+
+ appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ appendStringInfo(buf, "%s[%d]=%d ",
+ sep, i, b125->slot_idxs[i]);
+ sep = ",";
+ }
+
+ appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+ for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+ appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+ appendStringInfo(buf, "\n");
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ if (RT_NODE_IS_LEAF(node))
+ appendStringInfo(buf, "%schunk 0x%X\n",
+ spaces.data, i);
+ else
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+ appendStringInfo(buf, "%schunk 0x%X",
+ spaces.data, i);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i),
+ level + 1, recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+ for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+ appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+ appendStringInfo(buf, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ appendStringInfo(buf, "%schunk 0x%X\n",
+ spaces.data, i);
+ }
+ else
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ appendStringInfo(buf, "%schunk 0x%X",
+ spaces.data, i);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i),
+ level + 1, recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ StringInfoData buf;
+ int shift;
+ int level = 0;
+
+ RT_STATS(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ if (key > tree->ctl->max_val)
+ {
+ fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+ key, key);
+ return;
+ }
+
+ initStringInfo(&buf);
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child;
+
+ RT_DUMP_NODE(tree, allocnode, level, false, &buf);
+
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_VALUE_TYPE dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+ break;
+ }
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ break;
+
+ allocnode = child;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+
+ fprintf(stderr, "%s", buf.data);
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+ StringInfoData buf;
+
+ RT_STATS(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ initStringInfo(&buf);
+
+ RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+
+ fprintf(stderr, "%s",buf.data);
+}
+#endif
+
+#endif /* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef BM_IDX
+#undef BM_BIT
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_3
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_3
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_3
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
+#undef RT_CLASS_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_SWITCH_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..99c90771b9
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,106 @@
+/* TODO: shrink nodes */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+ n3->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+ n3->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
+ n32->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int idx;
+ int bitnum;
+
+ if (slotpos == RT_INVALID_SLOT_IDX)
+ return false;
+
+ idx = BM_IDX(slotpos);
+ bitnum = BM_BIT(slotpos);
+ n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+ n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+ RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+ break;
+ }
+ }
+
+ /* update statistics */
+ node->count--;
+
+ return true;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..22aca0e6cc
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,317 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ const bool is_leaf = true;
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+ const bool is_leaf = false;
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx;
+
+ idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n3->values[idx] = value;
+#else
+ n3->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n3)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE32_TYPE *new32;
+ const uint8 new_kind = RT_NODE_KIND_32;
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
+
+ /* grow node from 3 to 32 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
+ new32->base.chunks, new32->values);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
+ new32->base.chunks, new32->children);
+#endif
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+ int count = n3->base.n.count;
+
+ /* shift chunks and children */
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
+ count, insertpos);
+#endif
+ }
+
+ n3->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n3->values[insertpos] = value;
+#else
+ n3->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx;
+
+ idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[idx] = value;
+#else
+ n32->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n32)) &&
+ n32->base.n.fanout < class32_max.fanout)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+ Assert(n32->base.n.fanout == class32_min.fanout);
+
+ /* grow to the next size class of this kind */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ n32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ memcpy(newnode, node, class32_min.leaf_size);
+#else
+ memcpy(newnode, node, class32_min.inner_size);
+#endif
+ newnode->fanout = class32_max.fanout;
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n32)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE125_TYPE *new125;
+ const uint8 new_kind = RT_NODE_KIND_125;
+ const RT_SIZE_CLASS new_class = RT_CLASS_125;
+
+ Assert(n32->base.n.fanout == class32_max.fanout);
+
+ /* grow node from 32 to 125 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new125 = (RT_NODE125_TYPE *) newnode;
+
+ for (int i = 0; i < class32_max.fanout; i++)
+ {
+ new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ new125->values[i] = n32->values[i];
+#else
+ new125->children[i] = n32->children[i];
+#endif
+ }
+
+ /*
+ * Since we just copied a dense array, we can set the bits
+ * using a single store, provided the length of that array
+ * is at most the number of bits in a bitmapword.
+ */
+ Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+ new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+ int count = n32->base.n.count;
+
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+ count, insertpos);
+#endif
+ }
+
+ n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = value;
+#else
+ n32->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int cnt = 0;
+
+ if (slotpos != RT_INVALID_SLOT_IDX)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = value;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n125)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE256_TYPE *new256;
+ const uint8 new_kind = RT_NODE_KIND_256;
+ const RT_SIZE_CLASS new_class = RT_CLASS_256;
+
+ /* grow node from 125 to 256 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new256 = (RT_NODE256_TYPE *) newnode;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+ RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ cnt++;
+ }
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int idx;
+ bitmapword inverse;
+
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+ {
+ if (n125->base.isset[idx] < ~((bitmapword) 0))
+ break;
+ }
+
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(n125->base.isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+ Assert(slotpos < node->fanout);
+
+ /* mark the slot used */
+ n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+ n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = value;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+#else
+ chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
+#endif
+ Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(n256, chunk, value);
+#else
+ RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ RT_VERIFY_NODE(node);
+
+ return chunk_exists;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..823d7107c4
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,138 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ bool found = false;
+ uint8 key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_VALUE_TYPE value;
+
+ Assert(RT_NODE_IS_LEAF(node_iter->node));
+#else
+ RT_PTR_LOCAL child = NULL;
+
+ Assert(!RT_NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+ Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n3->base.n.count)
+ break;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n3->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+#endif
+ key_chunk = n3->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+#ifdef RT_NODE_LEVEL_LEAF
+ if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+ if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = value;
+#endif
+ }
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return found;
+#else
+ return child;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..c8410e9a5c
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,122 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(value_p != NULL);
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+ Assert(child_p != NULL);
+#endif
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n3->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = n3->values[idx];
+#else
+ *child_p = n3->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n32->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = n32->values[idx];
+#else
+ *child_p = n32->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+ Assert(slotpos != RT_INVALID_SLOT_IDX);
+ n125->children[slotpos] = new_child;
+#else
+ if (slotpos == RT_INVALID_SLOT_IDX)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+ *child_p = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+ RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+ *child_p = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ }
+
+#ifdef RT_ACTION_UPDATE
+ return;
+#else
+ return true;
+#endif /* RT_ACTION_UPDATE */
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..2af215484f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,6 +121,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
test_pg_db_role_setting \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
subdir('test_pg_db_role_setting')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..2a93e731ae
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,673 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int rt_node_kind_fanouts[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 125, /* RT_NODE_KIND_125 */
+ 256 /* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ TestValueType dummy;
+ uint64 key;
+ TestValueType val;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ rt_radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], (TestValueType) keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* look up keys */
+ for (int i = 0; i < children; i++)
+ {
+ TestValueType value;
+
+ if (!rt_search(radixtree, keys[i], &value))
+ elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (value != (TestValueType) keys[i])
+ elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+ value, (TestValueType) keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_set(radixtree, keys[i], (TestValueType) (keys[i] + 1)))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], (TestValueType) keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ TestValueType val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != (TestValueType) key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, (TestValueType) key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx - 1]
+ : rt_node_kind_fanouts[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx]
+ : rt_node_kind_fanouts[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(radixtree_ctx, dsa);
+#else
+ radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, (TestValueType) x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ TestValueType v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != (TestValueType) x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ TestValueType val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != (TestValueType) expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ TestValueType v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ rt_free(radixtree);
+ MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ test_basic(rt_node_kind_fanouts[i], false);
+ test_basic(rt_node_kind_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
--
2.31.1
v24-0004-Tool-for-measuring-radix-tree-performance.patchapplication/octet-stream; name=v24-0004-Tool-for-measuring-radix-tree-performance.patchDownload
From aa1bb230f2760dbc9185b3237bbd4aba735b20c0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v24 4/9] Tool for measuring radix tree performance
Includes Meson support, but commented out to avoid warnings
XXX: Not for commit
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 76 ++
contrib/bench_radix_tree/bench_radix_tree.c | 656 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/meson.build | 33 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
contrib/meson.build | 1 +
8 files changed, 822 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/meson.build
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..2fd689aa91
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..4c785c7336
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,656 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ rt_radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 search_time_ms;
+ Datum values[2] = {0};
+ bool nulls[2] = {0};
+ /* from trial and error */
+ uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (!PG_ARGISNULL(1))
+ {
+ char *filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+ if (sscanf(filter_str, "0x%lX", &filter) == 0)
+ elog(ERROR, "invalid filter string %s", filter_str);
+ }
+ elog(NOTICE, "bench with filter 0x%lX", filter);
+
+ rt = rt_create(CurrentMemoryContext);
+
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+
+ rt_set(rt, key, key);
+ }
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_sparseload_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ uint64 r,
+ h;
+ int key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ key_id = 0;
+
+ for (r = 1; r <= fanout; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (h);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+
+ rt_stats(rt);
+
+ /* measure sparse deletion and re-loading */
+ start_time = GetCurrentTimestamp();
+
+ for (int t = 0; t<10000; t++)
+ {
+ /* delete one key in each leaf */
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ rt_delete(rt, key);
+ }
+
+ /* add them all back */
+ key_id = 0;
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ key_id++;
+ rt_set(rt, key, key_id);
+ }
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_sparseload_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+ 'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+ bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'bench_radix_tree',
+ '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+ bench_radix_tree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+ 'bench_radix_tree.control',
+ 'bench_radix_tree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'bench_radix_tree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'bench_radix_tree',
+ ],
+ },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
subdir('auth_delay')
subdir('auto_explain')
subdir('basic_archive')
+#subdir('bench_radix_tree')
subdir('bloom')
subdir('basebackup_to_shell')
subdir('bool_plperl')
--
2.31.1
On Tue, Jan 31, 2023 at 9:43 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
I've attached v24 patches. The locking support patch is separated
(0005 patch). Also I kept the updates for TidStore and the vacuum
integration from v23 separate.
Okay, that's a lot more simple, and closer to what I imagined. For v25, I
squashed v24's additions and added a couple of my own. I've kept the CF
status at "needs review" because no specific action is required at the
moment.
I did start to review the TID store some more, but that's on hold because
something else came up: On a lark I decided to re-run some benchmarks to
see if anything got lost in converting to a template, and that led me down
a rabbit hole -- some good and bad news on that below.
0001:
I removed the uint64 case, as discussed. There is now a brief commit
message, but needs to be fleshed out a bit. I took another look at the Arm
optimization that Nathan found some month ago, for forming the highbit
mask, but that doesn't play nicely with how node32 uses it, so I decided
against it. I added a comment to describe the reasoning in case someone
else gets a similar idea.
I briefly looked into "separate-commit TODO: move non-SIMD fallbacks to
their own header to clean up the #ifdef maze.", but decided it wasn't such
a clear win to justify starting the work now. It's still in the back of my
mind, but I removed the reminder from the commit message.
0003:
The template now requires the value to be passed as a pointer. That was a
pretty trivial change, but affected multiple other patches, so not sent
separately. Also adds a forgotten RT_ prefix to the bitmap macros and adds
a top comment to the *_impl.h headers. There are some comment fixes. The
changes were either trivial or discussed earlier, so also not sent
separately.
0004/5: I wanted to measure the load time as well as search time in
bench_search_random_nodes(). That's kept separate to make it easier to test
other patch versions.
The bad news is that the speed of loading TIDs in
bench_seq/shuffle_search() has regressed noticeably. I can't reproduce this
in any other bench function and was the reason for writing 0005 to begin
with. More confusingly, my efforts to fix this improved *other* functions,
but the former didn't budge at all. First the patches:
0006 adds and removes some "inline" declarations (where it made sense), and
added some for "pg_noinline" based on Andres' advice some months ago.
0007 removes some dead code. RT_NODE_INSERT_INNER is only called during
RT_SET_EXTEND, so it can't possibly find an existing key. This kind of
change is much easier with the inner/node cases handled together in a
template, as far as being sure of how those cases are different. I thought
about trying the search in assert builds and verifying it doesn't exist,
but thought yet another #ifdef would be too messy.
v25-addendum-try-no-maintain-order.txt -- It makes optional keeping the key
chunks in order for the linear-search nodes. I believe the TID store no
longer cares about the ordering, but this is a text file for now because I
don't want to clutter the CI with a behavior change. Also, the second ART
paper (on concurrency) mentioned that some locking schemes don't allow
these arrays to be shifted. So it might make sense to give up entirely on
guaranteeing ordered iteration, or at least make it optional as in the
patch.
Now for some numbers:
========================================
psql -c "select * from bench_search_random_nodes(10*1000*1000)"
(min load time of three)
v15:
mem_allocated | load_ms | search_ms
---------------+---------+-----------
334182184 | 3352 | 2073
v25-0005:
mem_allocated | load_ms | search_ms
---------------+---------+-----------
331987008 | 3426 | 2126
v25-0006 (inlining or not):
mem_allocated | load_ms | search_ms
---------------+---------+-----------
331987008 | 3327 | 2035
v25-0007 (remove dead code):
mem_allocated | load_ms | search_ms
---------------+---------+-----------
331987008 | 3313 | 2037
v25-addendum...txt (no ordering):
mem_allocated | load_ms | search_ms
---------------+---------+-----------
331987008 | 2762 | 2042
Allowing unordered inserts helps a lot here in loading. That's expected
because there are a lot of inserts into the linear nodes. 0006 might help a
little.
========================================
psql -c "select avg(load_ms) from generate_series(1,30) x(x), lateral
(select * from bench_load_random_int(500 * 1000 * (1+x-x))) a"
v15:
avg
----------------------
207.3000000000000000
v25-0005:
avg
----------------------
190.6000000000000000
v25-0006 (inlining or not):
avg
----------------------
189.3333333333333333
v25-0007 (remove dead code):
avg
----------------------
186.4666666666666667
v25-addendum...txt (no ordering):
avg
----------------------
179.7000000000000000
Most of the improvement from v15 to v25 probably comes from the change from
node4 to node3, and this test stresses that node the most. That shows in
the total memory used: it goes from 152MB to 132MB. Allowing unordered
inserts helps some, the others are not convincing.
========================================
psql -c "select rt_load_ms, rt_search_ms from bench_seq_search(0, 1 * 1000
* 1000)"
(min load time of three)
v15:
rt_load_ms | rt_search_ms
------------+--------------
113 | 455
v25-0005:
rt_load_ms | rt_search_ms
------------+--------------
135 | 456
v25-0006 (inlining or not):
rt_load_ms | rt_search_ms
------------+--------------
136 | 455
v25-0007 (remove dead code):
rt_load_ms | rt_search_ms
------------+--------------
135 | 455
v25-addendum...txt (no ordering):
rt_load_ms | rt_search_ms
------------+--------------
134 | 455
Note: The regression seems to have started in v17, which is the first with
a full template.
Nothing so far has helped here, and previous experience has shown that
trying to profile 100ms will not be useful. Instead of putting more effort
into diving deeper, it seems a better use of time to write a benchmark that
calls the tid store itself. That's more realistic, since this function was
intended to test load and search of tids, but the tid store doesn't quite
operate so simply anymore. What do you think, Masahiko?
I'm inclined to keep 0006, because it might give a slight boost, and 0007
because it's never a bad idea to remove dead code.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v25-addendum-try-no-maintain-order.txttext/plain; charset=US-ASCII; name=v25-addendum-try-no-maintain-order.txtDownload
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 4e00b46d9b..3f831227c9 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -80,9 +80,10 @@
}
else
{
- int insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+ int insertpos;// = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
int count = n3->base.n.count;
-
+#ifdef RT_MAINTAIN_ORDERING
+ insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
/* shift chunks and children */
if (insertpos < count)
{
@@ -95,6 +96,9 @@
count, insertpos);
#endif
}
+#else
+ insertpos = count;
+#endif /* order */
n3->base.chunks[insertpos] = chunk;
#ifdef RT_NODE_LEVEL_LEAF
@@ -186,8 +190,10 @@
}
else
{
- int insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+ int insertpos;
int count = n32->base.n.count;
+#ifdef RT_MAINTAIN_ORDERING
+ insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
if (insertpos < count)
{
@@ -200,6 +206,9 @@
count, insertpos);
#endif
}
+#else
+ insertpos = count;
+#endif
n32->base.chunks[insertpos] = chunk;
#ifdef RT_NODE_LEVEL_LEAF
v25-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v25-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload
From 86c2d232a0ea193a856cb0348e0825b5e4b7a4b7 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v25 2/9] Move some bitmap logic out of bitmapset.c
Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.
Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().
Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.
Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
src/backend/nodes/bitmapset.c | 36 ++------------------------------
src/include/nodes/bitmapset.h | 16 ++++++++++++--
src/include/port/pg_bitutils.h | 31 +++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 -
4 files changed, 47 insertions(+), 37 deletions(-)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x))
/*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
{
int result;
- w = RIGHTMOST_ONE(w);
+ w = bmw_rightmost_one(w);
a->words[wordnum] &= ~w;
result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 3d2225e1ae..5f9a511b4a 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
#else
#define BITS_PER_BITMAPWORD 32
typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
#endif
@@ -75,6 +73,20 @@ typedef enum
BMS_MULTIPLE /* >1 member */
} BMS_Membership;
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w) pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
+#define bmw_popcount(w) pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w) pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
+#define bmw_popcount(w) pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
/*
* function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+ int32 result = (int32) word & -((int32) word);
+ return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+ int64 result = (int64) word & -((int64) word);
+ return (uint64) result;
+}
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 07fbb7ccf6..f4d1d60cd2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3662,7 +3662,6 @@ shmem_request_hook_type
shmem_startup_hook_type
sig_atomic_t
sigjmp_buf
-signedbitmapword
sigset_t
size_t
slist_head
--
2.39.1
v25-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchtext/x-patch; charset=US-ASCII; name=v25-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchDownload
From 949c6eef5ff7cc4f8ef2673f9aa63142a1d913ae Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v25 1/9] Introduce helper SIMD functions for small byte arrays
vector8_min - helper for emulating ">=" semantics
vector8_highbit_mask - used to turn the result of a vector
comparison into a bitmask
Masahiko Sawada
Reviewed by Nathan Bossart, additional adjustments by me
Discussion: https://www.postgresql.org/message-id/CAD21AoDap240WDDdUDE0JMpCmuMMnGajrKrkCRxM7zn9Xk3JRA%40mail.gmail.com
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..350e2caaea 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -79,6 +79,7 @@ static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#endif
/* arithmetic operations */
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -299,6 +301,36 @@ vector32_is_highbit_set(const Vector32 v)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Return a bitmask formed from the high-bit of each element.
+ */
+#ifndef USE_NO_SIMD
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ /*
+ * Note: There is a faster way to do this, but it returns a uint64 and
+ * and if the caller wanted to extract the bit position using CTZ,
+ * it would have to divide that result by 4.
+ */
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
/*
* Return the bitwise OR of the inputs
*/
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Given two vectors, return a vector with the minimum element of each.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.39.1
v25-0005-Measure-load-time-of-bench_search_random_nodes.patchtext/x-patch; charset=US-ASCII; name=v25-0005-Measure-load-time-of-bench_search_random_nodes.patchDownload
From 8edd5b4c0fcbf7681c5388faaf85a96ae451c99e Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 7 Feb 2023 13:06:00 +0700
Subject: [PATCH v25 5/9] Measure load time of bench_search_random_nodes
---
.../bench_radix_tree/bench_radix_tree--1.0.sql | 1 +
contrib/bench_radix_tree/bench_radix_tree.c | 17 ++++++++++++-----
2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 2fd689aa91..95eedbbe10 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -47,6 +47,7 @@ create function bench_search_random_nodes(
cnt int8,
filter_str text DEFAULT NULL,
OUT mem_allocated int8,
+OUT load_ms int8,
OUT search_ms int8)
returns record
as 'MODULE_PATHNAME'
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 73ddee32de..7d1e2eee57 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -395,9 +395,10 @@ bench_search_random_nodes(PG_FUNCTION_ARGS)
end_time;
long secs;
int usecs;
+ int64 load_time_ms;
int64 search_time_ms;
- Datum values[2] = {0};
- bool nulls[2] = {0};
+ Datum values[3] = {0};
+ bool nulls[3] = {0};
/* from trial and error */
uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
@@ -416,13 +417,18 @@ bench_search_random_nodes(PG_FUNCTION_ARGS)
rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
for (uint64 i = 0; i < cnt; i++)
{
- const uint64 hash = hash64(i);
- const uint64 key = hash & filter;
+ uint64 hash = hash64(i);
+ uint64 key = hash & filter;
rt_set(rt, key, &key);
}
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
elog(NOTICE, "sleeping for 2 seconds...");
pg_usleep(2 * 1000000L);
@@ -449,7 +455,8 @@ bench_search_random_nodes(PG_FUNCTION_ARGS)
rt_stats(rt);
values[0] = Int64GetDatum(rt_memory_usage(rt));
- values[1] = Int64GetDatum(search_time_ms);
+ values[1] = Int64GetDatum(load_time_ms);
+ values[2] = Int64GetDatum(search_time_ms);
rt_free(rt);
PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
--
2.39.1
v25-0004-Tool-for-measuring-radix-tree-performance.patchtext/x-patch; charset=US-ASCII; name=v25-0004-Tool-for-measuring-radix-tree-performance.patchDownload
From 6fb21eb0b44b5923c0b736d82e86b1d4a40a71d6 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v25 4/9] Tool for measuring radix tree performance
Includes Meson support, but commented out to avoid warnings
XXX: Not for commit
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 76 ++
contrib/bench_radix_tree/bench_radix_tree.c | 656 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/meson.build | 33 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
contrib/meson.build | 1 +
8 files changed, 822 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/meson.build
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..2fd689aa91
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..73ddee32de
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,656 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ rt_radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, &val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, &val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, &key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 search_time_ms;
+ Datum values[2] = {0};
+ bool nulls[2] = {0};
+ /* from trial and error */
+ uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (!PG_ARGISNULL(1))
+ {
+ char *filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+ if (sscanf(filter_str, "0x%lX", &filter) == 0)
+ elog(ERROR, "invalid filter string %s", filter_str);
+ }
+ elog(NOTICE, "bench with filter 0x%lX", filter);
+
+ rt = rt_create(CurrentMemoryContext);
+
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+
+ rt_set(rt, key, &key);
+ }
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 hash = hash64(i);
+ uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ uint64 key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, &key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_sparseload_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ uint64 r,
+ h;
+ uint64 key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ key_id = 0;
+
+ for (r = 1; r <= fanout; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (h);
+
+ key_id++;
+ rt_set(rt, key, &key_id);
+ }
+ }
+
+ rt_stats(rt);
+
+ /* measure sparse deletion and re-loading */
+ start_time = GetCurrentTimestamp();
+
+ for (int t = 0; t<10000; t++)
+ {
+ /* delete one key in each leaf */
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ rt_delete(rt, key);
+ }
+
+ /* add them all back */
+ key_id = 0;
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ key_id++;
+ rt_set(rt, key, &key_id);
+ }
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_sparseload_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+ 'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+ bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'bench_radix_tree',
+ '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+ bench_radix_tree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+ 'bench_radix_tree.control',
+ 'bench_radix_tree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'bench_radix_tree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'bench_radix_tree',
+ ],
+ },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
subdir('auth_delay')
subdir('auto_explain')
subdir('basic_archive')
+#subdir('bench_radix_tree')
subdir('bloom')
subdir('basebackup_to_shell')
subdir('bool_plperl')
--
2.39.1
v25-0003-Add-radixtree-template.patchtext/x-patch; charset=US-ASCII; name=v25-0003-Add-radixtree-template.patchDownload
From f421579a2e04baa04b258399e01f01485ce6f358 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v25 3/9] Add radixtree template
WIP: commit message based on template comments
---
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 2516 +++++++++++++++++
src/include/lib/radixtree_delete_impl.h | 122 +
src/include/lib/radixtree_insert_impl.h | 332 +++
src/include/lib/radixtree_iter_impl.h | 153 +
src/include/lib/radixtree_search_impl.h | 138 +
src/include/utils/dsa.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 36 +
src/test/modules/test_radixtree/meson.build | 35 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 674 +++++
.../test_radixtree/test_radixtree.control | 4 +
src/tools/pginclude/cpluspluscheck | 6 +
src/tools/pginclude/headerscheck | 6 +
20 files changed, 4086 insertions(+)
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/include/lib/radixtree_delete_impl.h
create mode 100644 src/include/lib/radixtree_insert_impl.h
create mode 100644 src/include/lib/radixtree_iter_impl.h
create mode 100644 src/include/lib/radixtree_search_impl.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..80555aefff 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d6919aef08
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2516 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Template for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ * tional leaf node type which stores one value.
+ * - Multi-value leaves: The values are stored in one of four
+ * different leaf node types, which mirror the structure of
+ * inner nodes, but contain values instead of pointers.
+ * - Combined pointer/value slots: If values fit into point-
+ * ers, no separate node types are necessary. Instead, each
+ * pointer storage location in an inner node can either
+ * store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * To handle concurrency, we use a single reader-writer lock for the radix
+ * tree. The radix tree is exclusively locked during write operations such
+ * as RT_SET() and RT_DELETE(), and shared locked during read operations
+ * such as RT_SEARCH(). An iteration also holds the shared lock on the radix
+ * tree until it is completed.
+ *
+ * TODO: The current locking mechanism is not optimized for high concurrency
+ * with mixed read-write workloads. In the future it might be worthwhile
+ * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
+ * the paper "The ART of Practical Synchronization" by the same authors as
+ * the ART paper, 2016.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included. Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * will result in radix tree type 'foo_radix_tree' and functions like
+ * 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ * generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ * declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
+ *
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ * so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE - Create a new, empty radix tree
+ * RT_FREE - Free the radix tree
+ * RT_SEARCH - Search a key-value pair
+ * RT_SET - Set a key-value pair
+ * RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT - Return next key-value pair, if any
+ * RT_END_ITER - End iteration
+ * RT_MEMORY_USAGE - Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH - Attach to the radix tree
+ * RT_DETACH - Detach from the radix tree
+ * RT_GET_HANDLE - Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE - Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif /* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define RT_BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
+#define RT_BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ * statements.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ * in the future to tag the node pointer with the kind, even on
+ * platforms with 32-bit pointers. This might speed up node traversal
+ * in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_3 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Max capacity for the current size class. Storing this in the
+ * node enables multiple size classes per node kind.
+ * Technically, kinds with a single size class don't need this, so we could
+ * keep this in the individual base types, but the code is simpler this way.
+ * Note: node256 is unique in that it cannot possibly have more than a
+ * single size class, so for that kind we store zero, and uint8 is
+ * sufficient for other kinds.
+ */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree) LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree) LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree) LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree) ((void) 0)
+#define RT_LOCK_SHARED(tree) ((void) 0)
+#define RT_UNLOCK(tree) ((void) 0)
+#endif
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+#define RT_NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
+
+#define RT_NODE_MUST_GROW(node) \
+ ((node)->base.n.count == (node)->base.n.fanout)
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_3
+{
+ RT_NODE n;
+
+ /* 3 children, for key chunks */
+ uint8 chunks[3];
+} RT_NODE_BASE_3;
+
+typedef struct RT_NODE_BASE_32
+{
+ RT_NODE n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_125
+{
+ RT_NODE n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* bitmap to track which slots are in use */
+ bitmapword isset[RT_BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+ RT_NODE n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * These are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_3
+{
+ RT_NODE_BASE_3 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_3;
+
+typedef struct RT_NODE_LEAF_3
+{
+ RT_NODE_BASE_3 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_3;
+
+typedef struct RT_NODE_INNER_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+ RT_NODE_BASE_256 base;
+
+ /* Slots for 256 children */
+ RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+ RT_NODE_BASE_256 base;
+
+ /*
+ * Unlike with inner256, zero is a valid value here, so we use a
+ * bitmap to track which slots are in use.
+ */
+ bitmapword isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ RT_VALUE_TYPE values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+ RT_CLASS_3 = 0,
+ RT_CLASS_32_MIN,
+ RT_CLASS_32_MAX,
+ RT_CLASS_125,
+ RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+} RT_SIZE_CLASS_ELEM;
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+ [RT_CLASS_3] = {
+ .name = "radix tree node 3",
+ .fanout = 3,
+ .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_32_MIN] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_32_MAX] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_125] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(RT_NODE_INNER_256),
+ .leaf_size = sizeof(RT_NODE_LEAF_256),
+ },
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+ RT_HANDLE handle;
+ uint32 magic;
+ LWLock lock;
+#endif
+
+ RT_PTR_ALLOC root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+ MemoryContext context;
+
+ /* pointing to either local memory or DSA */
+ RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+ dsa_area *dsa;
+#else
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+ RT_PTR_LOCAL node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+ RT_RADIX_TREE *tree;
+
+ /* Track the iteration on nodes of each level */
+ RT_NODE_ITER stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is constructed during iteration */
+ uint64 key;
+} RT_ITER;
+
+
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_VALUE_TYPE *value_p);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+ return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+ return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+ return DsaPointerIsValid(ptr);
+#else
+ return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ /* replicate the search key */
+ spread_chunk = vector8_broadcast(chunk);
+
+ /* compare to all 32 keys stored in the node */
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+
+ /* convert comparison to a bitfield */
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+ /* mask off invalid entries */
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ /* convert bitfield to index by counting trailing zeros */
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ /*
+ * This is coded with '>=' to match what we can do with SIMD,
+ * with an assert to keep us honest.
+ */
+ if (node->chunks[index] >= chunk)
+ {
+ Assert(node->chunks[index] != chunk);
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ /*
+ * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+ * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+ * we need to play some trickery using vector8_min() to effectively get
+ * >=. There'll never be any equal elements in current uses, but that's
+ * what we get here...
+ */
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+ uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+ uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+ Assert(RT_NODE_IS_LEAF(node));
+ Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+ return node->children[chunk];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ Assert(RT_NODE_IS_LEAF(node));
+ Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ node->isset[idx] |= ((bitmapword) 1 << bitnum);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+ if (key == 0)
+ return 0;
+ else
+ return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+ RT_PTR_ALLOC allocnode;
+ size_t allocsize;
+
+ if (is_leaf)
+ allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+ else
+ allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+ allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+ if (is_leaf)
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ allocsize);
+ else
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+ allocsize);
+#endif
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->ctl->cnt[size_class]++;
+#endif
+
+ return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+ if (is_leaf)
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+ else
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+
+ node->kind = kind;
+
+ if (kind == RT_NODE_KIND_256)
+ /* See comment for the RT_NODE type */
+ Assert(node->fanout == 0);
+ else
+ node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+ memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
+ }
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+ int shift = RT_KEY_GET_SHIFT(key);
+ bool is_leaf = shift == 0;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ newnode->shift = shift;
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+ tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->count = oldnode->count;
+}
+
+/*
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
+ */
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+ uint8 new_kind, uint8 new_class, bool is_leaf)
+{
+ RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
+ RT_COPY_NODE(newnode, node);
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->ctl->root == allocnode)
+ {
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+ tree->ctl->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->ctl->cnt[i]--;
+ Assert(tree->ctl->cnt[i] >= 0);
+ }
+#endif
+
+#ifdef RT_SHMEM
+ dsa_free(tree->dsa, allocnode);
+#else
+ pfree(allocnode);
+#endif
+}
+
+/* Update the parent's pointer when growing a node */
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
+ RT_PTR_ALLOC new_child, uint64 key)
+{
+#ifdef USE_ASSERT_CHECKING
+ RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+ Assert(old_child->shift == new->shift);
+ Assert(old_child->count == new->count);
+#endif
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new larger node */
+ tree->ctl->root = new_child;
+ }
+ else
+ RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+ RT_FREE_NODE(tree, stored_old_child);
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+ int target_shift;
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ int shift = root->shift + RT_NODE_SPAN;
+
+ target_shift = RT_KEY_GET_SHIFT(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ RT_NODE_INNER_3 *n3;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+ node->shift = shift;
+ node->count = 1;
+
+ n3 = (RT_NODE_INNER_3 *) node;
+ n3->base.chunks[0] = 0;
+ n3->children[0] = tree->ctl->root;
+
+ /* Update the root */
+ tree->ctl->root = allocnode;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
+{
+ int shift = node->shift;
+
+ Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ RT_PTR_ALLOC allocchild;
+ RT_PTR_LOCAL newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool is_leaf = newshift == 0;
+
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ newchild->shift = newshift;
+ RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
+
+ parent = node;
+ node = newchild;
+ stored_node = allocchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value_p);
+ tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static bool
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+ RT_RADIX_TREE *tree;
+ MemoryContext old_ctx;
+#ifdef RT_SHMEM
+ dsa_pointer dp;
+#endif
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+ tree->context = ctx;
+
+#ifdef RT_SHMEM
+ tree->dsa = dsa;
+ dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+ tree->ctl->handle = dp;
+ tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+ LWLockInitialize(&tree->ctl->lock, tranche_id);
+#else
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+ /* Create a slab context for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+ size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+ size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ size_class.name,
+ inner_blocksize,
+ size_class.inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ size_class.name,
+ leaf_blocksize,
+ size_class.leaf_size);
+ }
+#endif
+
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+ RT_RADIX_TREE *tree;
+ dsa_pointer control;
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ tree->dsa = dsa;
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static inline void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();
+
+ /* The leaf node doesn't have child pointers */
+ if (RT_NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->dsa, ptr);
+ return;
+ }
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+ for (int i = 0; i < n3->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n3->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ for (int i = 0; i < n32->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+ }
+
+ break;
+ }
+ }
+
+ /* Free the inner node */
+ dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ /* Free all memory used for radix tree nodes */
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_FREE_RECURSE(tree, tree->ctl->root);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->dsa, tree->ctl->handle);
+#else
+ pfree(tree->ctl);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+#endif
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+ int shift;
+ bool updated;
+ RT_PTR_LOCAL parent;
+ RT_PTR_ALLOC stored_child;
+ RT_PTR_LOCAL child;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ RT_LOCK_EXCLUSIVE(tree);
+
+ /* Empty tree, create the root */
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_NEW_ROOT(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->ctl->max_val)
+ RT_EXTEND(tree, key);
+
+ stored_child = tree->ctl->root;
+ parent = RT_PTR_GET_LOCAL(tree, stored_child);
+ shift = parent->shift;
+
+ /* Descend the tree until we reach a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;;
+
+ child = RT_PTR_GET_LOCAL(tree, stored_child);
+
+ if (RT_NODE_IS_LEAF(child))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
+ {
+ RT_SET_EXTEND(tree, key, value_p, parent, stored_child, child);
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ parent = child;
+ stored_child = new_child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value_p);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->ctl->num_keys++;
+
+ RT_UNLOCK(tree);
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *value_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+ bool found;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+ Assert(value_p != NULL);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ shift = node->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+ if (RT_NODE_IS_LEAF(node))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ node = RT_PTR_GET_LOCAL(tree, child);
+ shift -= RT_NODE_SPAN;
+ }
+
+ found = RT_NODE_SEARCH_LEAF(node, key, value_p);
+
+ RT_UNLOCK(tree);
+ return found;
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ RT_LOCK_EXCLUSIVE(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+ /* Push the current node to the stack */
+ stack[++level] = allocnode;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ allocnode = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_LEAF(node, key);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->ctl->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (node->count > 0)
+ {
+ RT_UNLOCK(tree);
+ return true;
+ }
+
+ /* Free the empty leaf node */
+ RT_FREE_NODE(tree, allocnode);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ allocnode = stack[level--];
+
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_INNER(node, key);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (node->count > 0)
+ break;
+
+ /* The node became empty */
+ RT_FREE_NODE(tree, allocnode);
+ }
+
+ RT_UNLOCK(tree);
+ return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+ RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+ int level = from;
+ RT_PTR_LOCAL node = from_node;
+
+ for (;;)
+ {
+ RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (RT_NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/*
+ * Create and return the iterator for the given radix tree.
+ *
+ * The radix tree is locked in shared mode during the iteration, so
+ * RT_END_ITERATE needs to be called when finished to release the lock.
+ */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+ MemoryContext old_ctx;
+ RT_ITER *iter;
+ RT_PTR_LOCAL root;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+ iter->tree = tree;
+
+ RT_LOCK_SHARED(tree);
+
+ /* empty tree */
+ if (!iter->tree->ctl->root)
+ return iter;
+
+ root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+ top_level = root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->ctl->root)
+ return false;
+
+ for (;;)
+ {
+ RT_PTR_LOCAL child = NULL;
+ RT_VALUE_TYPE value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+/*
+ * Terminate the iteration and release the lock.
+ *
+ * This function needs to be called after finishing or when exiting an
+ * iteration.
+ */
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+#ifdef RT_SHMEM
+ Assert(LWLockHeldByMe(&iter->tree->ctl->lock));
+#endif
+
+ RT_UNLOCK(iter->tree);
+ pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+ Size total = 0;
+
+ RT_LOCK_SHARED(tree);
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ total = dsa_get_total_size(tree->dsa);
+#else
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+#endif
+
+ RT_UNLOCK(tree);
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
+
+ for (int i = 1; i < n3->n.count; i++)
+ Assert(n3->chunks[i - 1] < n3->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ uint8 slot = n125->slot_idxs[i];
+ int idx = RT_BM_IDX(slot);
+ int bitnum = RT_BM_BIT(slot);
+
+ if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(slot < node->fanout);
+ Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
+ cnt += bmw_popcount(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+ RT_LOCK_SHARED(tree);
+
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+ fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+ fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+
+ fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+ root->shift / RT_NODE_SPAN,
+ tree->ctl->cnt[RT_CLASS_3],
+ tree->ctl->cnt[RT_CLASS_32_MIN],
+ tree->ctl->cnt[RT_CLASS_32_MAX],
+ tree->ctl->cnt[RT_CLASS_125],
+ tree->ctl->cnt[RT_CLASS_256]);
+ }
+
+ RT_UNLOCK(tree);
+}
+
+static void
+RT_DUMP_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, int level,
+ bool recurse, StringInfo buf)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+ StringInfoData spaces;
+
+ initStringInfo(&spaces);
+ appendStringInfoSpaces(&spaces, (level * 4) + 1);
+
+ appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u, shift %u:\n",
+ spaces.data,
+ level == 0 ? "" : "-> ",
+ RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_3) ? 3 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+ spaces.data, i, n3->base.chunks[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X",
+ spaces.data, i, n3->base.chunks[i]);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, n3->children[i], level + 1,
+ recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+ spaces.data, i, n32->base.chunks[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X",
+ spaces.data, i, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, n32->children[i], level + 1,
+ recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+ char *sep = "";
+
+ appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ appendStringInfo(buf, "%s[%d]=%d ",
+ sep, i, b125->slot_idxs[i]);
+ sep = ",";
+ }
+
+ appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+ for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+ appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+ appendStringInfo(buf, "\n");
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ if (RT_NODE_IS_LEAF(node))
+ appendStringInfo(buf, "%schunk 0x%X\n",
+ spaces.data, i);
+ else
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+ appendStringInfo(buf, "%schunk 0x%X",
+ spaces.data, i);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i),
+ level + 1, recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+ for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+ appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+ appendStringInfo(buf, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ appendStringInfo(buf, "%schunk 0x%X\n",
+ spaces.data, i);
+ }
+ else
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ appendStringInfo(buf, "%schunk 0x%X",
+ spaces.data, i);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i),
+ level + 1, recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ StringInfoData buf;
+ int shift;
+ int level = 0;
+
+ RT_STATS(tree);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ if (key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+ key, key);
+ return;
+ }
+
+ initStringInfo(&buf);
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child;
+
+ RT_DUMP_NODE(tree, allocnode, level, false, &buf);
+
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_VALUE_TYPE dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+ break;
+ }
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ break;
+
+ allocnode = child;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+ RT_UNLOCK(tree);
+
+ fprintf(stderr, "%s", buf.data);
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+ StringInfoData buf;
+
+ RT_STATS(tree);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ initStringInfo(&buf);
+
+ RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+ RT_UNLOCK(tree);
+
+ fprintf(stderr, "%s",buf.data);
+}
+#endif
+
+#endif /* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef RT_BM_IDX
+#undef RT_BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_3
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_3
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_3
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
+#undef RT_CLASS_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_SWITCH_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..5f6dda1f12
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,122 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_delete_impl.h
+ * Common implementation for deletion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ * TODO: Shrink nodes when deletion would allow them to fit in a smaller
+ * size class.
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_delete_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+ n3->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+ n3->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
+ n32->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int idx;
+ int bitnum;
+
+ if (slotpos == RT_INVALID_SLOT_IDX)
+ return false;
+
+ idx = RT_BM_IDX(slotpos);
+ bitnum = RT_BM_BIT(slotpos);
+ n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+ n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+ RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+ break;
+ }
+ }
+
+ /* update statistics */
+ node->count--;
+
+ return true;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..c18e26b537
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,332 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_insert_impl.h
+ * Common implementation for insertion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_insert_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ const bool is_leaf = true;
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+ const bool is_leaf = false;
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx;
+
+ idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n3->values[idx] = *value_p;
+#else
+ n3->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n3)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE32_TYPE *new32;
+ const uint8 new_kind = RT_NODE_KIND_32;
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
+
+ /* grow node from 3 to 32 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
+ new32->base.chunks, new32->values);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
+ new32->base.chunks, new32->children);
+#endif
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+ int count = n3->base.n.count;
+
+ /* shift chunks and children */
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
+ count, insertpos);
+#endif
+ }
+
+ n3->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n3->values[insertpos] = *value_p;
+#else
+ n3->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx;
+
+ idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[idx] = *value_p;
+#else
+ n32->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n32)) &&
+ n32->base.n.fanout < class32_max.fanout)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+ Assert(n32->base.n.fanout == class32_min.fanout);
+
+ /* grow to the next size class of this kind */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ n32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ memcpy(newnode, node, class32_min.leaf_size);
+#else
+ memcpy(newnode, node, class32_min.inner_size);
+#endif
+ newnode->fanout = class32_max.fanout;
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n32)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE125_TYPE *new125;
+ const uint8 new_kind = RT_NODE_KIND_125;
+ const RT_SIZE_CLASS new_class = RT_CLASS_125;
+
+ Assert(n32->base.n.fanout == class32_max.fanout);
+
+ /* grow node from 32 to 125 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new125 = (RT_NODE125_TYPE *) newnode;
+
+ for (int i = 0; i < class32_max.fanout; i++)
+ {
+ new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ new125->values[i] = n32->values[i];
+#else
+ new125->children[i] = n32->children[i];
+#endif
+ }
+
+ /*
+ * Since we just copied a dense array, we can set the bits
+ * using a single store, provided the length of that array
+ * is at most the number of bits in a bitmapword.
+ */
+ Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+ new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+ int count = n32->base.n.count;
+
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+ count, insertpos);
+#endif
+ }
+
+ n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = *value_p;
+#else
+ n32->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int cnt = 0;
+
+ if (slotpos != RT_INVALID_SLOT_IDX)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = *value_p;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n125)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE256_TYPE *new256;
+ const uint8 new_kind = RT_NODE_KIND_256;
+ const RT_SIZE_CLASS new_class = RT_CLASS_256;
+
+ /* grow node from 125 to 256 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new256 = (RT_NODE256_TYPE *) newnode;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+ RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ cnt++;
+ }
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int idx;
+ bitmapword inverse;
+
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < RT_BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+ {
+ if (n125->base.isset[idx] < ~((bitmapword) 0))
+ break;
+ }
+
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(n125->base.isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+ Assert(slotpos < node->fanout);
+
+ /* mark the slot used */
+ n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+ n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = *value_p;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+#else
+ chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
+#endif
+ Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(n256, chunk, *value_p);
+#else
+ RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ RT_VERIFY_NODE(node);
+
+ return chunk_exists;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..98c78eb237
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,153 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_iter_impl.h
+ * Common implementation for iteration in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_iter_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ bool found = false;
+ uint8 key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_VALUE_TYPE value;
+
+ Assert(RT_NODE_IS_LEAF(node_iter->node));
+#else
+ RT_PTR_LOCAL child = NULL;
+
+ Assert(!RT_NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+ Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n3->base.n.count)
+ break;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n3->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+#endif
+ key_chunk = n3->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+#ifdef RT_NODE_LEVEL_LEAF
+ if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+ if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = value;
+#endif
+ }
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return found;
+#else
+ return child;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..a8925c75d0
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,138 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_search_impl.h
+ * Common implementation for search in leaf and inner nodes, plus
+ * update for inner nodes only.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_search_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(value_p != NULL);
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+ Assert(child_p != NULL);
+#endif
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n3->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = n3->values[idx];
+#else
+ *child_p = n3->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n32->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = n32->values[idx];
+#else
+ *child_p = n32->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+ Assert(slotpos != RT_INVALID_SLOT_IDX);
+ n125->children[slotpos] = new_child;
+#else
+ if (slotpos == RT_INVALID_SLOT_IDX)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+ *child_p = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+ RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+ *child_p = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ }
+
+#ifdef RT_ACTION_UPDATE
+ return;
+#else
+ return true;
+#endif /* RT_ACTION_UPDATE */
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..2af215484f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,6 +121,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
test_pg_db_role_setting \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
subdir('test_pg_db_role_setting')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..f944945db9
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,674 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int rt_node_kind_fanouts[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 125, /* RT_NODE_KIND_125 */
+ 256 /* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ TestValueType dummy;
+ uint64 key;
+ TestValueType val;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ rt_radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* look up keys */
+ for (int i = 0; i < children; i++)
+ {
+ TestValueType value;
+
+ if (!rt_search(radixtree, keys[i], &value))
+ elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (value != (TestValueType) keys[i])
+ elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+ value, (TestValueType) keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ TestValueType update = keys[i] + 1;
+ if (!rt_set(radixtree, keys[i], (TestValueType*) &update))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ TestValueType val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != (TestValueType) key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, (TestValueType*) &key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx - 1]
+ : rt_node_kind_fanouts[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx]
+ : rt_node_kind_fanouts[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+#else
+ radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, (TestValueType*) &x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ TestValueType v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != (TestValueType) x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ TestValueType val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != (TestValueType) expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ TestValueType v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ rt_free(radixtree);
+ MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ test_basic(rt_node_kind_fanouts[i], false);
+ test_basic(rt_node_kind_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
--
2.39.1
v25-0006-Adjust-some-inlining-declarations.patchtext/x-patch; charset=US-ASCII; name=v25-0006-Adjust-some-inlining-declarations.patchDownload
From 77541d3f48e6fef39645df5b3c535ac431e12194 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 6 Feb 2023 21:04:14 +0700
Subject: [PATCH v25 6/9] Adjust some inlining declarations
---
src/include/lib/radixtree.h | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d6919aef08..4bd0aaa810 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1124,7 +1124,7 @@ RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_le
* Create a new node as the root. Subordinate nodes will be created during
* the insertion.
*/
-static void
+static pg_noinline void
RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
{
int shift = RT_KEY_GET_SHIFT(key);
@@ -1215,7 +1215,7 @@ RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
/*
* Replace old_child with new_child, and free the old one.
*/
-static void
+static inline void
RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
RT_PTR_ALLOC new_child, uint64 key)
@@ -1242,7 +1242,7 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
* The radix tree doesn't have sufficient height. Extend the radix tree so
* it can store the key.
*/
-static void
+static pg_noinline void
RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
{
int target_shift;
@@ -1281,7 +1281,7 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
* The radix tree doesn't have inner and leaf nodes for given key-value pair.
* Insert inner and leaf nodes from 'node' to bottom.
*/
-static inline void
+static pg_noinline void
RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
{
@@ -1486,7 +1486,7 @@ RT_GET_HANDLE(RT_RADIX_TREE *tree)
/*
* Recursively free all nodes allocated to the DSA area.
*/
-static inline void
+static void
RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
{
RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
--
2.39.1
v25-0007-Skip-unnecessary-searches-in-RT_NODE_INSERT_INNE.patchtext/x-patch; charset=US-ASCII; name=v25-0007-Skip-unnecessary-searches-in-RT_NODE_INSERT_INNE.patchDownload
From f2a3340200ea26c17de5c5261adbeaada64ae4b6 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 6 Feb 2023 22:04:50 +0700
Subject: [PATCH v25 7/9] Skip unnecessary searches in RT_NODE_INSERT_INNER
For inner nodes, we know the key chunk doesn't exist already,
otherwise we would have found it while descending the tree.
To reinforce this fact, declare this function to return void.
---
src/include/lib/radixtree.h | 4 +--
src/include/lib/radixtree_insert_impl.h | 48 ++++++++++++-------------
2 files changed, 24 insertions(+), 28 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 4bd0aaa810..1cdb995e54 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -685,7 +685,7 @@ typedef struct RT_ITER
} RT_ITER;
-static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+static void RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
uint64 key, RT_PTR_ALLOC child);
static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
uint64 key, RT_VALUE_TYPE *value_p);
@@ -1375,7 +1375,7 @@ RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
* If the node we're inserting into needs to grow, we update the parent's
* child pointer with the pointer to the new larger node.
*/
-static bool
+static void
RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
uint64 key, RT_PTR_ALLOC child)
{
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index c18e26b537..d56e58dcac 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -28,10 +28,10 @@
#endif
uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
- bool chunk_exists = false;
#ifdef RT_NODE_LEVEL_LEAF
const bool is_leaf = true;
+ bool chunk_exists = false;
Assert(RT_NODE_IS_LEAF(node));
#else
const bool is_leaf = false;
@@ -43,21 +43,18 @@
case RT_NODE_KIND_3:
{
RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
- int idx;
- idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+#ifdef RT_NODE_LEVEL_LEAF
+ int idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+
if (idx != -1)
{
/* found the existing chunk */
chunk_exists = true;
-#ifdef RT_NODE_LEVEL_LEAF
n3->values[idx] = *value_p;
-#else
- n3->children[idx] = child;
-#endif
break;
}
-
+#endif
if (unlikely(RT_NODE_MUST_GROW(n3)))
{
RT_PTR_ALLOC allocnode;
@@ -113,21 +110,18 @@
{
const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
- int idx;
- idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+#ifdef RT_NODE_LEVEL_LEAF
+ int idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+
if (idx != -1)
{
/* found the existing chunk */
chunk_exists = true;
-#ifdef RT_NODE_LEVEL_LEAF
n32->values[idx] = *value_p;
-#else
- n32->children[idx] = child;
-#endif
break;
}
-
+#endif
if (unlikely(RT_NODE_MUST_GROW(n32)) &&
n32->base.n.fanout < class32_max.fanout)
{
@@ -220,21 +214,19 @@
case RT_NODE_KIND_125:
{
RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
- int slotpos = n125->base.slot_idxs[chunk];
+ int slotpos;
int cnt = 0;
+#ifdef RT_NODE_LEVEL_LEAF
+ slotpos = n125->base.slot_idxs[chunk];
if (slotpos != RT_INVALID_SLOT_IDX)
{
/* found the existing chunk */
chunk_exists = true;
-#ifdef RT_NODE_LEVEL_LEAF
n125->values[slotpos] = *value_p;
-#else
- n125->children[slotpos] = child;
-#endif
break;
}
-
+#endif
if (unlikely(RT_NODE_MUST_GROW(n125)))
{
RT_PTR_ALLOC allocnode;
@@ -300,14 +292,10 @@
#ifdef RT_NODE_LEVEL_LEAF
chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
-#else
- chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
-#endif
Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
-
-#ifdef RT_NODE_LEVEL_LEAF
RT_NODE_LEAF_256_SET(n256, chunk, *value_p);
#else
+ Assert(node->count < RT_NODE_MAX_SLOTS);
RT_NODE_INNER_256_SET(n256, chunk, child);
#endif
break;
@@ -315,8 +303,12 @@
}
/* Update statistics */
+#ifdef RT_NODE_LEVEL_LEAF
if (!chunk_exists)
node->count++;
+#else
+ node->count++;
+#endif
/*
* Done. Finally, verify the chunk and value is inserted or replaced
@@ -324,7 +316,11 @@
*/
RT_VERIFY_NODE(node);
+#ifdef RT_NODE_LEVEL_LEAF
return chunk_exists;
+#else
+ return;
+#endif
#undef RT_NODE3_TYPE
#undef RT_NODE32_TYPE
--
2.39.1
v25-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchtext/x-patch; charset=US-ASCII; name=v25-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload
From 693e335f77211e9947cd356d9287c9af96e78815 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v25 8/9] Add TIDStore, to store sets of TIDs (ItemPointerData)
efficiently.
The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.
The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.
This includes a unit test module, in src/test/modules/test_tidstore.
---
doc/src/sgml/monitoring.sgml | 4 +
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 688 ++++++++++++++++++
src/backend/storage/lmgr/lwlock.c | 2 +
src/include/access/tidstore.h | 49 ++
src/include/storage/lwlock.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_tidstore/Makefile | 23 +
.../test_tidstore/expected/test_tidstore.out | 13 +
src/test/modules/test_tidstore/meson.build | 35 +
.../test_tidstore/sql/test_tidstore.sql | 7 +
.../test_tidstore/test_tidstore--1.0.sql | 8 +
.../modules/test_tidstore/test_tidstore.c | 195 +++++
.../test_tidstore/test_tidstore.control | 4 +
16 files changed, 1033 insertions(+)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
create mode 100644 src/test/modules/test_tidstore/Makefile
create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
create mode 100644 src/test/modules/test_tidstore/meson.build
create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
create mode 100644 src/test/modules/test_tidstore/test_tidstore.control
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..d936aa3da3 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2192,6 +2192,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Waiting to access a shared TID bitmap during a parallel bitmap
index scan.</entry>
</row>
+ <row>
+ <entry><literal>SharedTidStore</literal></entry>
+ <entry>Waiting to access a shared TID store.</entry>
+ </row>
<row>
<entry><literal>SharedTupleStore</literal></entry>
<entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..4c72673ce9
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,688 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach().
+ *
+ * Regarding the concurrency, it basically relies on the concurrency support in
+ * the radix tree, but we acquires the lock on a TidStore in some cases, for
+ * example, when to reset the store and when to access the number tids in the
+ * store (num_tids).
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, tids are represented as a pair of 64-bit key and
+ * 64-bit value. First, we construct 64-bit unsigned integer by combining
+ * the block number and the offset number. The number of bits used for the
+ * offset number is specified by max_offsets in tidstore_create(). We are
+ * frugal with the bits, because smaller keys could help keeping the radix
+ * tree shallow.
+ *
+ * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. That
+ * is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits
+ * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
+ * as the key:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ * |----| value
+ * |---------------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ *
+ * If the number of bits for offset number fits in a 64-bit value, we don't
+ * encode tids but directly use the block number and the offset number as key
+ * and value, respectively.
+ */
+#define TIDSTORE_VALUE_NBITS 6 /* log(64, 2) */
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The control object for a TidStore */
+typedef struct TidStoreControl
+{
+ /* the number of tids in the store */
+ int64 num_tids;
+
+ /* These values are never changed after creation */
+ size_t max_bytes; /* the maximum bytes a TidStore can use */
+ int max_offset; /* the maximum offset number */
+ int offset_nbits; /* the number of bits required for max_offset */
+ bool encode_tids; /* do we use tid encoding? */
+ int offset_key_nbits; /* the number of bits of a offset number
+ * used for the key */
+
+ /* The below fields are used only in shared case */
+
+ uint32 magic;
+ LWLock lock;
+
+ /* handles for TidStore and radix tree */
+ tidstore_handle handle;
+ shared_rt_handle tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+ /*
+ * Control object. This is allocated in DSA area 'area' in the shared
+ * case, otherwise in backend-local memory.
+ */
+ TidStoreControl *control;
+
+ /* Storage for Tids. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ local_rt_radix_tree *local;
+ shared_rt_radix_tree *shared;
+ } tree;
+
+ /* DSA area for TidStore if used */
+ dsa_area *area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+ TidStore *ts;
+
+ /* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ shared_rt_iter *shared;
+ local_rt_iter *local;
+ } tree_iter;
+
+ /* we returned all tids? */
+ bool finished;
+
+ /* save for the next iteration */
+ uint64 next_key;
+ uint64 next_val;
+
+ /* output for the caller */
+ TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+{
+ TidStore *ts;
+
+ ts = palloc0(sizeof(TidStore));
+
+ /*
+ * Create the radix tree for the main storage.
+ *
+ * Memory consumption depends on the number of stored tids, but also on the
+ * distribution of them, how the radix tree stores, and the memory management
+ * that backed the radix tree. The maximum bytes that a TidStore can
+ * use is specified by the max_bytes in tidstore_create(). We want the total
+ * amount of memory consumption by a TidStore not to exceed the max_bytes.
+ *
+ * In local TidStore cases, the radix tree uses slab allocators for each kind
+ * of node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+ * slab block for a new radix tree node, which is approximately 70kB. Therefore,
+ * we deduct 70kB from the max_bytes.
+ *
+ * In shared cases, DSA allocates the memory segments big enough to follow
+ * a geometric series that approximately doubles the total DSA size (see
+ * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+ * size and the simulation revealed, the 75% threshold for the maximum bytes
+ * perfectly works in case where the max_bytes is a power-of-2, and the 60%
+ * threshold works for other cases.
+ */
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+ float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+ LWTRANCHE_SHARED_TIDSTORE);
+
+ dp = dsa_allocate0(area, sizeof(TidStoreControl));
+ ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+ ts->control->max_bytes = (uint64) (max_bytes * ratio);
+ ts->area = area;
+
+ ts->control->magic = TIDSTORE_MAGIC;
+ LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+ ts->control->handle = dp;
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+ }
+ else
+ {
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+ ts->control->max_bytes = max_bytes - (70 * 1024);
+ }
+
+ ts->control->max_offset = max_offset;
+ ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+
+ /*
+ * We use tid encoding if the number of bits for the offset number doesn't
+ * fix in a value, uint64.
+ */
+ if (ts->control->offset_nbits > TIDSTORE_VALUE_NBITS)
+ {
+ ts->control->encode_tids = true;
+ ts->control->offset_key_nbits =
+ ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+ }
+ else
+ {
+ ts->control->encode_tids = false;
+ ts->control->offset_key_nbits = 0;
+ }
+
+ return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ /* create per-backend state */
+ ts = palloc0(sizeof(TidStore));
+
+ /* Find the control object in shared memory */
+ control = handle;
+
+ /* Set up the TidStore */
+ ts->control = (TidStoreControl *) dsa_get_address(area, control);
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+ ts->area = area;
+
+ return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ shared_rt_detach(ts->tree.shared);
+ pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix
+ * tree.
+ */
+ ts->control->magic = 0;
+ dsa_free(ts->area, ts->control->handle);
+ shared_rt_free(ts->tree.shared);
+ }
+ else
+ {
+ pfree(ts->control);
+ local_rt_free(ts->tree.local);
+ }
+
+ pfree(ts);
+}
+
+/*
+ * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * entire TidStore but recreate only the radix tree storage.
+ */
+void
+tidstore_reset(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /*
+ * Free the radix tree and return allocated DSA segments to
+ * the operating system.
+ */
+ shared_rt_free(ts->tree.shared);
+ dsa_trim(ts->area);
+
+ /* Recreate the radix tree */
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+ LWTRANCHE_SHARED_TIDSTORE);
+
+ /* update the radix tree handle as we recreated it */
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+
+ LWLockRelease(&ts->control->lock);
+ }
+ else
+ {
+ local_rt_free(ts->tree.local);
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+ }
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ ItemPointerData tid;
+ uint64 key_base;
+ uint64 *values;
+ int nkeys;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ if (ts->control->encode_tids)
+ {
+ key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+ nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+ }
+ else
+ {
+ key_base = (uint64) blkno;
+ nkeys = 1;
+ }
+ values = palloc0(sizeof(uint64) * nkeys);
+
+ ItemPointerSetBlockNumber(&tid, blkno);
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint64 key;
+ uint32 off;
+ int idx;
+
+ ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+ /* encode the tid to key and val */
+ key = tid_to_key_off(ts, &tid, &off);
+
+ idx = key - key_base;
+ Assert(idx >= 0 && idx < nkeys);
+
+ values[idx] |= UINT64CONST(1) << off;
+ }
+
+ if (TidStoreIsShared(ts))
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /* insert the calculated key-values to the tree */
+ for (int i = 0; i < nkeys; i++)
+ {
+ if (values[i])
+ {
+ uint64 key = key_base + i;
+
+ if (TidStoreIsShared(ts))
+ shared_rt_set(ts->tree.shared, key, &values[i]);
+ else
+ local_rt_set(ts->tree.local, key, &values[i]);
+ }
+ }
+
+ /* update statistics */
+ ts->control->num_tids += num_offsets;
+
+ if (TidStoreIsShared(ts))
+ LWLockRelease(&ts->control->lock);
+
+ pfree(values);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val = 0;
+ uint32 off;
+ bool found;
+
+ key = tid_to_key_off(ts, tid, &off);
+
+ if (TidStoreIsShared(ts))
+ found = shared_rt_search(ts->tree.shared, key, &val);
+ else
+ found = local_rt_search(ts->tree.local, key, &val);
+
+ if (!found)
+ return false;
+
+ return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during the
+ * iteration, so tidstore_end_iterate() needs to called when finished.
+ *
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+ TidStoreIter *iter;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ iter = palloc0(sizeof(TidStoreIter));
+ iter->ts = ts;
+
+ iter->result.blkno = InvalidBlockNumber;
+ iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
+
+ if (TidStoreIsShared(ts))
+ iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+ else
+ iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+ /* If the TidStore is empty, there is no business */
+ if (tidstore_num_tids(ts) == 0)
+ iter->finished = true;
+
+ return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+
+ return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a pointer to TidStoreIterResult that has tids
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+ TidStoreIterResult *result = &(iter->result);
+
+ if (iter->finished)
+ return NULL;
+
+ if (BlockNumberIsValid(result->blkno))
+ {
+ /* Process the previously collected key-value */
+ result->num_offsets = 0;
+ tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (tidstore_iter_kv(iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = key_get_blkno(iter->ts, key);
+
+ if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+ {
+ /*
+ * We got a key-value pair for a different block. So return the
+ * collected tids, and remember the key-value for the next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+ return result;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_extract_tids(iter, key, val);
+ }
+
+ iter->finished = true;
+ return result;
+}
+
+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+ if (TidStoreIsShared(iter->ts))
+ shared_rt_end_iterate(iter->tree_iter.shared);
+ else
+ local_rt_end_iterate(iter->tree_iter.local);
+
+ pfree(iter->result.offsets);
+ pfree(iter);
+}
+
+/* Return the number of tids we collected so far */
+int64
+tidstore_num_tids(TidStore *ts)
+{
+ uint64 num_tids;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ if (!TidStoreIsShared(ts))
+ return ts->control->num_tids;
+
+ LWLockAcquire(&ts->control->lock, LW_SHARED);
+ num_tids = ts->control->num_tids;
+ LWLockRelease(&ts->control->lock);
+
+ return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+tidstore_max_memory(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+tidstore_memory_usage(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+ return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->handle;
+}
+
+/* Extract tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+ TidStoreIterResult *result = (&iter->result);
+
+ for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ if (i > iter->ts->control->max_offset)
+ {
+ Assert(!iter->ts->control->encode_tids);
+ break;
+ }
+
+ if ((val & (UINT64CONST(1) << i)) == 0)
+ continue;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= i;
+
+ off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+
+ Assert(result->num_offsets < iter->ts->control->max_offset);
+ result->offsets[result->num_offsets++] = off;
+ }
+
+ result->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, uint64 key)
+{
+ if (ts->control->encode_tids)
+ return (BlockNumber) (key >> ts->control->offset_key_nbits);
+
+ return (BlockNumber) key;
+}
+
+/* Encode a tid to key and offset */
+static inline uint64
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off)
+{
+ uint64 key;
+ uint64 tid_i;
+
+ if (!ts->control->encode_tids)
+ {
+ *off = ItemPointerGetOffsetNumber(tid);
+
+ /* Use the block number as the key */
+ return (int64) ItemPointerGetBlockNumber(tid);
+ }
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << ts->control->offset_nbits;
+
+ *off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1);
+ key = tid_i >> TIDSTORE_VALUE_NBITS;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
"SharedTupleStore",
/* LWTRANCHE_SHARED_TIDBITMAP: */
"SharedTidBitmap",
+ /* LWTRANCHE_SHARED_TIDSTORE: */
+ "SharedTidStore",
/* LWTRANCHE_PARALLEL_APPEND: */
"ParallelAppend",
/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..a35a52124a
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+ BlockNumber blkno;
+ OffsetNumber *offsets;
+ int num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern int64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern size_t tidstore_max_memory(TidStore *ts);
+extern size_t tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif /* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
LWTRANCHE_SHARED_TUPLESTORE,
LWTRANCHE_SHARED_TIDBITMAP,
+ LWTRANCHE_SHARED_TIDSTORE,
LWTRANCHE_PARALLEL_APPEND,
LWTRANCHE_PER_XACT_PREDICATE_LIST,
LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
test_rls_hooks \
test_shm_mq \
test_slru \
+ test_tidstore \
unsafe_tests \
worker_spi
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
subdir('test_rls_hooks')
subdir('test_shm_mq')
subdir('test_slru')
+subdir('test_tidstore')
subdir('unsafe_tests')
subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+ $(WIN32RES) \
+ test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE: testing empty tidstore
+NOTICE: testing basic operations
+ test_tidstore
+---------------
+
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+ 'test_tidstore.c',
+)
+
+if host_system == 'windows'
+ test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_tidstore',
+ '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+ test_tidstore_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+ 'test_tidstore.control',
+ 'test_tidstore--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_tidstore',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_tidstore',
+ ],
+ },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..9b849ae8e8
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,195 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ * Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+ ItemPointerData tid;
+ bool found;
+
+ ItemPointerSet(&tid, blkno, off);
+
+ found = tidstore_lookup_tid(ts, &tid);
+
+ if (found != expect)
+ elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+ blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS 5
+#define TEST_TIDSTORE_NUM_OFFSETS 5
+
+ TidStore *ts;
+ TidStoreIter *iter;
+ TidStoreIterResult *iter_result;
+ BlockNumber blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+ };
+ BlockNumber blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+ };
+ OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+ int blk_idx;
+
+ /* prepare the offset array */
+ offs[0] = FirstOffsetNumber;
+ offs[1] = FirstOffsetNumber + 1;
+ offs[2] = max_offset / 2;
+ offs[3] = max_offset - 1;
+ offs[4] = max_offset;
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+
+ /* add tids */
+ for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+ tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* lookup test */
+ for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+ {
+ bool expect = false;
+ for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+ {
+ if (offs[i] == off)
+ {
+ expect = true;
+ break;
+ }
+ }
+
+ check_tid(ts, 0, off, expect);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, expect);
+ }
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+ elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+ tidstore_num_tids(ts),
+ TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* iteration test */
+ iter = tidstore_begin_iterate(ts);
+ blk_idx = 0;
+ while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+ {
+ /* check the returned block number */
+ if (blks_sorted[blk_idx] != iter_result->blkno)
+ elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+ iter_result->blkno, blks_sorted[blk_idx]);
+
+ /* check the returned offset numbers */
+ if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+ elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+ iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+ for (int i = 0; i < iter_result->num_offsets; i++)
+ {
+ if (offs[i] != iter_result->offsets[i])
+ elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+ iter_result->offsets[i], iter_result->blkno, offs[i]);
+ }
+
+ blk_idx++;
+ }
+
+ if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+ elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+ blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+ /* remove all tids */
+ tidstore_reset(ts);
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+ /* lookup test for empty store */
+ for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+ off++)
+ {
+ check_tid(ts, 0, off, false);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, false);
+ }
+
+ tidstore_destroy(ts);
+}
+
+static void
+test_empty(void)
+{
+ TidStore *ts;
+ TidStoreIter *iter;
+ ItemPointerData tid;
+
+ elog(NOTICE, "testing empty tidstore");
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+
+ ItemPointerSet(&tid, 0, FirstOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+ ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+ MaxBlockNumber, MaxOffsetNumber);
+
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+ if (tidstore_is_full(ts))
+ elog(ERROR, "tidstore_is_full on empty store returned true");
+
+ iter = tidstore_begin_iterate(ts);
+
+ if (tidstore_iterate_next(iter) != NULL)
+ elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+ tidstore_end_iterate(iter);
+
+ tidstore_destroy(ts);
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ elog(NOTICE, "testing basic operations");
+ test_basic(MaxHeapTuplesPerPage);
+ test_basic(10);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
--
2.39.1
v25-0009-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchtext/x-patch; charset=US-ASCII; name=v25-0009-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload
From dcbcf6cdd786f9debf1536ac73093107debfafe8 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 17 Jan 2023 17:20:37 +0700
Subject: [PATCH v25 9/9] Use TIDStore for storing dead tuple TID during lazy
vacuum
Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which was not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.
Now we use TIDStore to store dead tuple TIDs. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.
Since we are no longer able to exactly estimate the maximum number of
TIDs can be stored the pg_stat_progress_vacuum shows the progress
information based on the amount of memory in bytes. The column names
are also changed to max_dead_tuple_bytes and num_dead_tuple_bytes.
In addition, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, the inital DSA
segment size. Due to that, we increase the minimum value of
maintenance_work_mem (also autovacuum_work_mem) from 1MB to 2MB.
XXX: needs to bump catalog version
---
doc/src/sgml/monitoring.sgml | 8 +-
src/backend/access/heap/vacuumlazy.c | 278 ++++++++-------------
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 78 +-----
src/backend/commands/vacuumparallel.c | 73 +++---
src/backend/postmaster/autovacuum.c | 6 +-
src/backend/storage/lmgr/lwlock.c | 2 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/commands/progress.h | 4 +-
src/include/commands/vacuum.h | 25 +-
src/include/storage/lwlock.h | 1 +
src/test/regress/expected/cluster.out | 2 +-
src/test/regress/expected/create_index.out | 2 +-
src/test/regress/expected/rules.out | 4 +-
src/test/regress/sql/cluster.sql | 2 +-
src/test/regress/sql/create_index.sql | 2 +-
16 files changed, 177 insertions(+), 314 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d936aa3da3..0230c74e3d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6870,10 +6870,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>max_dead_tuples</structfield> <type>bigint</type>
+ <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples that we can store before needing to perform
+ Amount of dead tuple data that we can store before needing to perform
an index vacuum cycle, based on
<xref linkend="guc-maintenance-work-mem"/>.
</para></entry>
@@ -6881,10 +6881,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>num_dead_tuples</structfield> <type>bigint</type>
+ <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples collected since the last index vacuum cycle.
+ Amount of dead tuple data collected since the last index vacuum cycle.
</para></entry>
</row>
</tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..b4e40423a8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,18 @@
* vacuumlazy.c
* Concurrent ("lazy") vacuuming.
*
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs
* that are to be removed from indexes. We want to ensure we can vacuum even
* the very largest relations with finite memory space usage. To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
*
* We are willing to use at most maintenance_work_mem (or perhaps
* autovacuum_work_mem) memory space to keep track of dead TIDs. We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables). If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * create a TidStore with the maximum bytes that can be used by the TidStore.
+ * If the TidStore is full, we must call lazy_vacuum to vacuum indexes (and to
+ * vacuum the pages that we've pruned). This frees up the memory space dedicated
+ * to storing dead TIDs.
*
* In practice VACUUM will often complete its initial pass over the target
* heap relation without ever running out of space to store TIDs. This means
@@ -40,6 +40,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TidStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -220,11 +221,14 @@ typedef struct LVRelState
typedef struct LVPagePruneState
{
bool hastup; /* Page prevents rel truncation? */
- bool has_lpdead_items; /* includes existing LP_DEAD items */
+
+ /* collected offsets of LP_DEAD items including existing ones */
+ OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+ int num_offsets;
/*
* State describes the proper VM bit states to set for the page following
- * pruning and freezing. all_visible implies !has_lpdead_items, but don't
+ * pruning and freezing. all_visible implies num_offsets == 0, but don't
* trust all_frozen result unless all_visible is also set to true.
*/
bool all_visible; /* Every item visible to all? */
@@ -259,8 +263,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -487,11 +492,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
}
/*
- * Allocate dead_items array memory using dead_items_alloc. This handles
- * parallel VACUUM initialization as part of allocating shared memory
- * space used for dead_items. (But do a failsafe precheck first, to
- * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
- * is already dangerously old.)
+ * Allocate dead_items memory using dead_items_alloc. This handles parallel
+ * VACUUM initialization as part of allocating shared memory space used for
+ * dead_items. (But do a failsafe precheck first, to ensure that parallel
+ * VACUUM won't be attempted at all when relfrozenxid is already dangerously
+ * old.)
*/
lazy_check_wraparound_failsafe(vacrel);
dead_items_alloc(vacrel, params->nworkers);
@@ -797,7 +802,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
* have collected the TIDs whose index tuples need to be removed.
*
* Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- * largely consists of marking LP_DEAD items (from collected TID array)
+ * largely consists of marking LP_DEAD items (from vacrel->dead_items)
* as LP_UNUSED. This has to happen in a second, final pass over the
* heap, to preserve a basic invariant that all index AMs rely on: no
* extant index tuple can ever be allowed to contain a TID that points to
@@ -825,21 +830,21 @@ lazy_scan_heap(LVRelState *vacrel)
blkno,
next_unskippable_block,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TidStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
const int initprog_index[] = {
PROGRESS_VACUUM_PHASE,
PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
- PROGRESS_VACUUM_MAX_DEAD_TUPLES
+ PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
};
int64 initprog_val[3];
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +911,7 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ if (tidstore_is_full(vacrel->dead_items))
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -969,7 +973,7 @@ lazy_scan_heap(LVRelState *vacrel)
continue;
}
- /* Collect LP_DEAD items in dead_items array, count tuples */
+ /* Collect LP_DEAD items in dead_items, count tuples */
if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
&recordfreespace))
{
@@ -1011,14 +1015,14 @@ lazy_scan_heap(LVRelState *vacrel)
* Prune, freeze, and count tuples.
*
* Accumulates details of remaining LP_DEAD line pointers on page in
- * dead_items array. This includes LP_DEAD line pointers that we
- * pruned ourselves, as well as existing LP_DEAD line pointers that
- * were pruned some time earlier. Also considers freezing XIDs in the
- * tuple headers of remaining items with storage.
+ * dead_items. This includes LP_DEAD line pointers that we pruned
+ * ourselves, as well as existing LP_DEAD line pointers that were pruned
+ * some time earlier. Also considers freezing XIDs in the tuple headers
+ * of remaining items with storage.
*/
lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
- Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+ Assert(!prunestate.all_visible || (prunestate.num_offsets == 0));
/* Remember the location of the last page with nonremovable tuples */
if (prunestate.hastup)
@@ -1034,14 +1038,12 @@ lazy_scan_heap(LVRelState *vacrel)
* performed here can be thought of as the one-pass equivalent of
* a call to lazy_vacuum().
*/
- if (prunestate.has_lpdead_items)
+ if (prunestate.num_offsets > 0)
{
Size freespace;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
- /* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+ prunestate.num_offsets, buf, vmbuffer);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1080,16 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(tidstore_num_tids(dead_items) == 0);
+ }
+ else if (prunestate.num_offsets > 0)
+ {
+ /* Save details of the LP_DEAD items from the page in dead_items */
+ tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
+ prunestate.num_offsets);
+
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
}
/*
@@ -1145,7 +1156,7 @@ lazy_scan_heap(LVRelState *vacrel)
* There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
* set, however.
*/
- else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+ else if ((prunestate.num_offsets > 0) && PageIsAllVisible(page))
{
elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
vacrel->relname, blkno);
@@ -1193,7 +1204,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Final steps for block: drop cleanup lock, record free space in the
* FSM
*/
- if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+ if ((prunestate.num_offsets > 0) && vacrel->do_index_vacuuming)
{
/*
* Wait until lazy_vacuum_heap_rel() to save free space. This
@@ -1249,7 +1260,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (tidstore_num_tids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1524,9 +1535,9 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
* The approach we take now is to restart pruning when the race condition is
* detected. This allows heap_page_prune() to prune the tuples inserted by
* the now-aborted transaction. This is a little crude, but it guarantees
- * that any items that make it into the dead_items array are simple LP_DEAD
- * line pointers, and that every remaining item with tuple storage is
- * considered as a candidate for freezing.
+ * that any items that make it into the dead_items are simple LP_DEAD line
+ * pointers, and that every remaining item with tuple storage is considered
+ * as a candidate for freezing.
*/
static void
lazy_scan_prune(LVRelState *vacrel,
@@ -1543,13 +1554,11 @@ lazy_scan_prune(LVRelState *vacrel,
HTSV_Result res;
int tuples_deleted,
tuples_frozen,
- lpdead_items,
live_tuples,
recently_dead_tuples;
int nnewlpdead;
HeapPageFreeze pagefrz;
int64 fpi_before = pgWalUsage.wal_fpi;
- OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1571,7 +1580,6 @@ retry:
pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
tuples_deleted = 0;
tuples_frozen = 0;
- lpdead_items = 0;
live_tuples = 0;
recently_dead_tuples = 0;
@@ -1580,9 +1588,9 @@ retry:
*
* We count tuples removed by the pruning step as tuples_deleted. Its
* final value can be thought of as the number of tuples that have been
- * deleted from the table. It should not be confused with lpdead_items;
- * lpdead_items's final value can be thought of as the number of tuples
- * that were deleted from indexes.
+ * deleted from the table. It should not be confused with
+ * prunestate->deadoffsets; prunestate->deadoffsets's final value can
+ * be thought of as the number of tuples that were deleted from indexes.
*/
tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
InvalidTransactionId, 0, &nnewlpdead,
@@ -1593,7 +1601,7 @@ retry:
* requiring freezing among remaining tuples with storage
*/
prunestate->hastup = false;
- prunestate->has_lpdead_items = false;
+ prunestate->num_offsets = 0;
prunestate->all_visible = true;
prunestate->all_frozen = true;
prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1638,7 +1646,7 @@ retry:
* (This is another case where it's useful to anticipate that any
* LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
*/
- deadoffsets[lpdead_items++] = offnum;
+ prunestate->deadoffsets[prunestate->num_offsets++] = offnum;
continue;
}
@@ -1875,7 +1883,7 @@ retry:
*/
#ifdef USE_ASSERT_CHECKING
/* Note that all_frozen value does not matter when !all_visible */
- if (prunestate->all_visible && lpdead_items == 0)
+ if (prunestate->all_visible && prunestate->num_offsets == 0)
{
TransactionId cutoff;
bool all_frozen;
@@ -1888,28 +1896,9 @@ retry:
}
#endif
- /*
- * Now save details of the LP_DEAD items from the page in vacrel
- */
- if (lpdead_items > 0)
+ if (prunestate->num_offsets > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
-
vacrel->lpdead_item_pages++;
- prunestate->has_lpdead_items = true;
-
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
/*
* It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1928,7 +1917,7 @@ retry:
/* Finally, add page-local counts to whole-VACUUM counts */
vacrel->tuples_deleted += tuples_deleted;
vacrel->tuples_frozen += tuples_frozen;
- vacrel->lpdead_items += lpdead_items;
+ vacrel->lpdead_items += prunestate->num_offsets;
vacrel->live_tuples += live_tuples;
vacrel->recently_dead_tuples += recently_dead_tuples;
}
@@ -1940,7 +1929,7 @@ retry:
* lazy_scan_prune, which requires a full cleanup lock. While pruning isn't
* performed here, it's quite possible that an earlier opportunistic pruning
* operation left LP_DEAD items behind. We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items for removal from indexes.
*
* For aggressive VACUUM callers, we may return false to indicate that a full
* cleanup lock is required for processing by lazy_scan_prune. This is only
@@ -2099,7 +2088,7 @@ lazy_scan_noprune(LVRelState *vacrel,
vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
vacrel->NewRelminMxid = NoFreezePageRelminMxid;
- /* Save any LP_DEAD items found on the page in dead_items array */
+ /* Save any LP_DEAD items found on the page in dead_items */
if (vacrel->nindexes == 0)
{
/* Using one-pass strategy (since table has no indexes) */
@@ -2129,8 +2118,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TidStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2127,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2198,7 +2179,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
return;
}
@@ -2227,7 +2208,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2254,8 +2235,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2300,7 +2281,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
}
/*
@@ -2373,7 +2354,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2392,9 +2373,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
/*
* lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
*
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
*
* We may also be able to truncate the line pointer array of the heap pages we
* visit. If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2410,10 +2390,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index = 0;
BlockNumber vacuumed_pages = 0;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2409,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
VACUUM_ERRCB_PHASE_VACUUM_HEAP,
InvalidBlockNumber, InvalidOffsetNumber);
- while (index < vacrel->dead_items->num_items)
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ while ((result = tidstore_iterate_next(iter)) != NULL)
{
BlockNumber blkno;
Buffer buf;
@@ -2437,7 +2419,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ blkno = result->blkno;
vacrel->blkno = blkno;
/*
@@ -2451,7 +2433,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+ buf, vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2461,6 +2444,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
vacuumed_pages++;
}
+ tidstore_end_iterate(iter);
vacrel->blkno = InvalidBlockNumber;
if (BufferIsValid(vmbuffer))
@@ -2470,36 +2454,31 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+ vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
}
/*
- * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- * vacrel->dead_items array.
+ * lazy_vacuum_heap_page() -- free page's LP_DEAD items.
*
* Caller must have an exclusive buffer lock on the buffer (though a full
* cleanup lock is also acceptable). vmbuffer must be valid and already have
* a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page. The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *deadoffsets, int num_offsets, Buffer buffer,
+ Buffer vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int nunused = 0;
@@ -2518,16 +2497,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = deadoffsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -2687,8 +2660,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
* lazy_vacuum_one_index() -- vacuum index relation.
*
* Delete all the index tuples containing a TID collected in
- * vacrel->dead_items array. Also update running statistics.
- * Exact details depend on index AM's ambulkdelete routine.
+ * vacrel->dead_items. Also update running statistics. Exact
+ * details depend on index AM's ambulkdelete routine.
*
* reltuples is the number of heap tuples to be passed to the
* bulkdelete callback. It's always assumed to be estimated.
@@ -3094,48 +3067,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
}
/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
-/*
- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
+ * in vacrel for caller.
*
* Also handles parallel initialization as part of allocating dead_items in
* DSM when required.
@@ -3143,11 +3076,9 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
+ int vac_work_mem = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
@@ -3174,7 +3105,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
+ vac_work_mem, MaxHeapTuplesPerPage,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3187,11 +3118,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = tidstore_create(vac_work_mem, MaxHeapTuplesPerPage,
+ NULL);
}
/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..a526e607fe 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1165,7 +1165,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
END AS phase,
S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
- S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+ S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7b1a4b127e..d8e680ca20 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int vac_cmp_itemptr(const void *left, const void *right);
/*
* Primary entry point for manual VACUUM and ANALYZE commands
@@ -2303,16 +2302,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TidStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ tidstore_num_tids(dead_items))));
return istat;
}
@@ -2343,82 +2342,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
* This has the right signature to be an IndexBulkDeleteCallback.
- *
- * Assumes dead_items array is sorted (in ascending TID order).
*/
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch((void *) itemptr,
- (void *) dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
-
- return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
- BlockNumber lblk,
- rblk;
- OffsetNumber loff,
- roff;
-
- lblk = ItemPointerGetBlockNumber((ItemPointer) left);
- rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
- if (lblk < rblk)
- return -1;
- if (lblk > rblk)
- return 1;
-
- loff = ItemPointerGetOffsetNumber((ItemPointer) left);
- roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
- if (loff < roff)
- return -1;
- if (loff > roff)
- return 1;
+ TidStore *dead_items = (TidStore *) state;
- return 0;
+ return tidstore_lookup_tid(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..d653683693 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -9,12 +9,11 @@
* In a parallel vacuum, we perform both index bulk deletion and index cleanup
* with parallel worker processes. Individual indexes are processed by one
* vacuum process. ParalleVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment. We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit. Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * the shared TidStore. We launch parallel worker processes at the start of
+ * parallel index bulk-deletion and index cleanup and once all indexes are
+ * processed, the parallel worker processes exit. Each time we process indexes
+ * in parallel, the parallel context is re-initialized so that the same DSM can
+ * be used for multiple passes of index bulk-deletion and index cleanup.
*
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
@@ -103,6 +102,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TidStore */
+ tidstore_handle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -166,7 +168,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ dsa_area *dead_items_area;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -222,20 +225,23 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int max_offset, int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -283,9 +289,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -351,6 +356,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = tidstore_create(vac_work_mem, max_offset, dead_items_dsa);
+ pvs->dead_items = dead_items;
+ pvs->dead_items_area = dead_items_dsa;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -360,6 +375,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = tidstore_get_handle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +384,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -434,6 +441,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ tidstore_destroy(pvs->dead_items);
+ dsa_detach(pvs->dead_items_area);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -442,7 +452,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TidStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -940,7 +950,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -984,10 +996,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1045,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ tidstore_detach(pvs.dead_items);
+ dsa_detach(dead_items_area);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index f5ea381c53..d88db3e1f8 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3397,12 +3397,12 @@ check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
return true;
/*
- * We clamp manually-set values to at least 1MB. Since
+ * We clamp manually-set values to at least 2MB. Since
* maintenance_work_mem is always set to at least this value, do the same
* here.
*/
- if (*newval < 1024)
- *newval = 1024;
+ if (*newval < 2048)
+ *newval = 2048;
return true;
}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 55b3a04097..c223a7dc94 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -192,6 +192,8 @@ static const char *const BuiltinTrancheNames[] = {
"LogicalRepLauncherDSA",
/* LWTRANCHE_LAUNCHER_HASH: */
"LogicalRepLauncherHash",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index b46e3b8c55..27a88b9369 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2312,7 +2312,7 @@ struct config_int ConfigureNamesInt[] =
GUC_UNIT_KB
},
&maintenance_work_mem,
- 65536, 1024, MAX_KILOBYTES,
+ 65536, 2048, MAX_KILOBYTES,
NULL, NULL, NULL
},
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
#define PROGRESS_VACUUM_HEAP_BLKS_SCANNED 2
#define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED 3
#define PROGRESS_VACUUM_NUM_INDEX_VACUUMS 4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES 5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES 5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES 6
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..a3ebb169ef 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
MultiXactId MultiXactCutoff;
};
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TidStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int vac_work_mem, int max_offset,
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 07002fdfbe..537b34b30c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DATA,
LWTRANCHE_LAUNCHER_DSA,
LWTRANCHE_LAUNCHER_HASH,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
-- ensure we don't use the index in CLUSTER nor the checking SELECTs
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
-- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..f6ae02eb14 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param3 AS heap_blks_scanned,
s.param4 AS heap_blks_vacuumed,
s.param5 AS index_vacuum_count,
- s.param6 AS max_dead_tuples,
- s.param7 AS num_dead_tuples
+ s.param6 AS max_dead_tuple_bytes,
+ s.param7 AS dead_tuple_bytes
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
--
2.39.1
On Tue, Feb 7, 2023 at 4:25 PM John Naylor <john.naylor@enterprisedb.com>
wrote:
[v25]
This conflicted with a commit from earlier today, so rebased in v26 with no
further changes.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v25-addendum-try-no-maintain-order.txttext/plain; charset=US-ASCII; name=v25-addendum-try-no-maintain-order.txtDownload
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 4e00b46d9b..3f831227c9 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -80,9 +80,10 @@
}
else
{
- int insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+ int insertpos;// = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
int count = n3->base.n.count;
-
+#ifdef RT_MAINTAIN_ORDERING
+ insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
/* shift chunks and children */
if (insertpos < count)
{
@@ -95,6 +96,9 @@
count, insertpos);
#endif
}
+#else
+ insertpos = count;
+#endif /* order */
n3->base.chunks[insertpos] = chunk;
#ifdef RT_NODE_LEVEL_LEAF
@@ -186,8 +190,10 @@
}
else
{
- int insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+ int insertpos;
int count = n32->base.n.count;
+#ifdef RT_MAINTAIN_ORDERING
+ insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
if (insertpos < count)
{
@@ -200,6 +206,9 @@
count, insertpos);
#endif
}
+#else
+ insertpos = count;
+#endif
n32->base.chunks[insertpos] = chunk;
#ifdef RT_NODE_LEVEL_LEAF
v26-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v26-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload
From f6f476ba71864821cb5144f513165671c64db1b2 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v26 2/9] Move some bitmap logic out of bitmapset.c
Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.
Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().
Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.
Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
src/backend/nodes/bitmapset.c | 36 ++------------------------------
src/include/nodes/bitmapset.h | 16 ++++++++++++--
src/include/port/pg_bitutils.h | 31 +++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 -
4 files changed, 47 insertions(+), 37 deletions(-)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x))
/*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
{
int result;
- w = RIGHTMOST_ONE(w);
+ w = bmw_rightmost_one(w);
a->words[wordnum] &= ~w;
result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 3d2225e1ae..5f9a511b4a 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
#else
#define BITS_PER_BITMAPWORD 32
typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
#endif
@@ -75,6 +73,20 @@ typedef enum
BMS_MULTIPLE /* >1 member */
} BMS_Membership;
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w) pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
+#define bmw_popcount(w) pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w) pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
+#define bmw_popcount(w) pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
/*
* function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+ int32 result = (int32) word & -((int32) word);
+ return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+ int64 result = (int64) word & -((int64) word);
+ return (uint64) result;
+}
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 07fbb7ccf6..f4d1d60cd2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3662,7 +3662,6 @@ shmem_request_hook_type
shmem_startup_hook_type
sig_atomic_t
sigjmp_buf
-signedbitmapword
sigset_t
size_t
slist_head
--
2.39.1
v26-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchtext/x-patch; charset=US-ASCII; name=v26-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchDownload
From cf3e16ed894fc0c6574c48eddad7c587e5dec688 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v26 1/9] Introduce helper SIMD functions for small byte arrays
vector8_min - helper for emulating ">=" semantics
vector8_highbit_mask - used to turn the result of a vector
comparison into a bitmask
Masahiko Sawada
Reviewed by Nathan Bossart, additional adjustments by me
Discussion: https://www.postgresql.org/message-id/CAD21AoDap240WDDdUDE0JMpCmuMMnGajrKrkCRxM7zn9Xk3JRA%40mail.gmail.com
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..350e2caaea 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -79,6 +79,7 @@ static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#endif
/* arithmetic operations */
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -299,6 +301,36 @@ vector32_is_highbit_set(const Vector32 v)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Return a bitmask formed from the high-bit of each element.
+ */
+#ifndef USE_NO_SIMD
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ /*
+ * Note: There is a faster way to do this, but it returns a uint64 and
+ * and if the caller wanted to extract the bit position using CTZ,
+ * it would have to divide that result by 4.
+ */
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
/*
* Return the bitwise OR of the inputs
*/
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Given two vectors, return a vector with the minimum element of each.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.39.1
v26-0005-Measure-load-time-of-bench_search_random_nodes.patchtext/x-patch; charset=US-ASCII; name=v26-0005-Measure-load-time-of-bench_search_random_nodes.patchDownload
From 4a0d293937876ec348f30ce4ab94da14b925b020 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 7 Feb 2023 13:06:00 +0700
Subject: [PATCH v26 5/9] Measure load time of bench_search_random_nodes
---
.../bench_radix_tree/bench_radix_tree--1.0.sql | 1 +
contrib/bench_radix_tree/bench_radix_tree.c | 17 ++++++++++++-----
2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 2fd689aa91..95eedbbe10 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -47,6 +47,7 @@ create function bench_search_random_nodes(
cnt int8,
filter_str text DEFAULT NULL,
OUT mem_allocated int8,
+OUT load_ms int8,
OUT search_ms int8)
returns record
as 'MODULE_PATHNAME'
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 73ddee32de..7d1e2eee57 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -395,9 +395,10 @@ bench_search_random_nodes(PG_FUNCTION_ARGS)
end_time;
long secs;
int usecs;
+ int64 load_time_ms;
int64 search_time_ms;
- Datum values[2] = {0};
- bool nulls[2] = {0};
+ Datum values[3] = {0};
+ bool nulls[3] = {0};
/* from trial and error */
uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
@@ -416,13 +417,18 @@ bench_search_random_nodes(PG_FUNCTION_ARGS)
rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
for (uint64 i = 0; i < cnt; i++)
{
- const uint64 hash = hash64(i);
- const uint64 key = hash & filter;
+ uint64 hash = hash64(i);
+ uint64 key = hash & filter;
rt_set(rt, key, &key);
}
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
elog(NOTICE, "sleeping for 2 seconds...");
pg_usleep(2 * 1000000L);
@@ -449,7 +455,8 @@ bench_search_random_nodes(PG_FUNCTION_ARGS)
rt_stats(rt);
values[0] = Int64GetDatum(rt_memory_usage(rt));
- values[1] = Int64GetDatum(search_time_ms);
+ values[1] = Int64GetDatum(load_time_ms);
+ values[2] = Int64GetDatum(search_time_ms);
rt_free(rt);
PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
--
2.39.1
v26-0004-Tool-for-measuring-radix-tree-performance.patchtext/x-patch; charset=US-ASCII; name=v26-0004-Tool-for-measuring-radix-tree-performance.patchDownload
From 0e328f6d85d30797af158f2a4070004fe40d93fe Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v26 4/9] Tool for measuring radix tree performance
Includes Meson support, but commented out to avoid warnings
XXX: Not for commit
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 76 ++
contrib/bench_radix_tree/bench_radix_tree.c | 656 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/meson.build | 33 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
contrib/meson.build | 1 +
8 files changed, 822 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/meson.build
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..2fd689aa91
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..73ddee32de
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,656 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ rt_radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, &val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, &val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, &key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 search_time_ms;
+ Datum values[2] = {0};
+ bool nulls[2] = {0};
+ /* from trial and error */
+ uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (!PG_ARGISNULL(1))
+ {
+ char *filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+ if (sscanf(filter_str, "0x%lX", &filter) == 0)
+ elog(ERROR, "invalid filter string %s", filter_str);
+ }
+ elog(NOTICE, "bench with filter 0x%lX", filter);
+
+ rt = rt_create(CurrentMemoryContext);
+
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ const uint64 hash = hash64(i);
+ const uint64 key = hash & filter;
+
+ rt_set(rt, key, &key);
+ }
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 hash = hash64(i);
+ uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ uint64 key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, &key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+ rt_stats(rt);
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_sparseload_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ uint64 r,
+ h;
+ uint64 key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ key_id = 0;
+
+ for (r = 1; r <= fanout; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (h);
+
+ key_id++;
+ rt_set(rt, key, &key_id);
+ }
+ }
+
+ rt_stats(rt);
+
+ /* measure sparse deletion and re-loading */
+ start_time = GetCurrentTimestamp();
+
+ for (int t = 0; t<10000; t++)
+ {
+ /* delete one key in each leaf */
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ rt_delete(rt, key);
+ }
+
+ /* add them all back */
+ key_id = 0;
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ key_id++;
+ rt_set(rt, key, &key_id);
+ }
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_sparseload_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+ 'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+ bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'bench_radix_tree',
+ '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+ bench_radix_tree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+ 'bench_radix_tree.control',
+ 'bench_radix_tree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'bench_radix_tree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'bench_radix_tree',
+ ],
+ },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
subdir('auth_delay')
subdir('auto_explain')
subdir('basic_archive')
+#subdir('bench_radix_tree')
subdir('bloom')
subdir('basebackup_to_shell')
subdir('bool_plperl')
--
2.39.1
v26-0003-Add-radixtree-template.patchtext/x-patch; charset=US-ASCII; name=v26-0003-Add-radixtree-template.patchDownload
From 4c4cbb9b13da160b8883e6c7f861516f3eedac6a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v26 3/9] Add radixtree template
WIP: commit message based on template comments
---
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 2516 +++++++++++++++++
src/include/lib/radixtree_delete_impl.h | 122 +
src/include/lib/radixtree_insert_impl.h | 332 +++
src/include/lib/radixtree_iter_impl.h | 153 +
src/include/lib/radixtree_search_impl.h | 138 +
src/include/utils/dsa.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 36 +
src/test/modules/test_radixtree/meson.build | 35 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 674 +++++
.../test_radixtree/test_radixtree.control | 4 +
src/tools/pginclude/cpluspluscheck | 6 +
src/tools/pginclude/headerscheck | 6 +
20 files changed, 4086 insertions(+)
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/include/lib/radixtree_delete_impl.h
create mode 100644 src/include/lib/radixtree_insert_impl.h
create mode 100644 src/include/lib/radixtree_iter_impl.h
create mode 100644 src/include/lib/radixtree_search_impl.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..80555aefff 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d6919aef08
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2516 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Template for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ * tional leaf node type which stores one value.
+ * - Multi-value leaves: The values are stored in one of four
+ * different leaf node types, which mirror the structure of
+ * inner nodes, but contain values instead of pointers.
+ * - Combined pointer/value slots: If values fit into point-
+ * ers, no separate node types are necessary. Instead, each
+ * pointer storage location in an inner node can either
+ * store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * To handle concurrency, we use a single reader-writer lock for the radix
+ * tree. The radix tree is exclusively locked during write operations such
+ * as RT_SET() and RT_DELETE(), and shared locked during read operations
+ * such as RT_SEARCH(). An iteration also holds the shared lock on the radix
+ * tree until it is completed.
+ *
+ * TODO: The current locking mechanism is not optimized for high concurrency
+ * with mixed read-write workloads. In the future it might be worthwhile
+ * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
+ * the paper "The ART of Practical Synchronization" by the same authors as
+ * the ART paper, 2016.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included. Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * will result in radix tree type 'foo_radix_tree' and functions like
+ * 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ * generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ * declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
+ *
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ * so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE - Create a new, empty radix tree
+ * RT_FREE - Free the radix tree
+ * RT_SEARCH - Search a key-value pair
+ * RT_SET - Set a key-value pair
+ * RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT - Return next key-value pair, if any
+ * RT_END_ITER - End iteration
+ * RT_MEMORY_USAGE - Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH - Attach to the radix tree
+ * RT_DETACH - Detach from the radix tree
+ * RT_GET_HANDLE - Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE - Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif /* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define RT_BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
+#define RT_BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ * statements.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ * in the future to tag the node pointer with the kind, even on
+ * platforms with 32-bit pointers. This might speed up node traversal
+ * in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_3 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Max capacity for the current size class. Storing this in the
+ * node enables multiple size classes per node kind.
+ * Technically, kinds with a single size class don't need this, so we could
+ * keep this in the individual base types, but the code is simpler this way.
+ * Note: node256 is unique in that it cannot possibly have more than a
+ * single size class, so for that kind we store zero, and uint8 is
+ * sufficient for other kinds.
+ */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree) LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree) LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree) LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree) ((void) 0)
+#define RT_LOCK_SHARED(tree) ((void) 0)
+#define RT_UNLOCK(tree) ((void) 0)
+#endif
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+#define RT_NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
+
+#define RT_NODE_MUST_GROW(node) \
+ ((node)->base.n.count == (node)->base.n.fanout)
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_3
+{
+ RT_NODE n;
+
+ /* 3 children, for key chunks */
+ uint8 chunks[3];
+} RT_NODE_BASE_3;
+
+typedef struct RT_NODE_BASE_32
+{
+ RT_NODE n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_125
+{
+ RT_NODE n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* bitmap to track which slots are in use */
+ bitmapword isset[RT_BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+ RT_NODE n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * These are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_3
+{
+ RT_NODE_BASE_3 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_3;
+
+typedef struct RT_NODE_LEAF_3
+{
+ RT_NODE_BASE_3 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_3;
+
+typedef struct RT_NODE_INNER_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+ RT_NODE_BASE_256 base;
+
+ /* Slots for 256 children */
+ RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+ RT_NODE_BASE_256 base;
+
+ /*
+ * Unlike with inner256, zero is a valid value here, so we use a
+ * bitmap to track which slots are in use.
+ */
+ bitmapword isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ RT_VALUE_TYPE values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+ RT_CLASS_3 = 0,
+ RT_CLASS_32_MIN,
+ RT_CLASS_32_MAX,
+ RT_CLASS_125,
+ RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+} RT_SIZE_CLASS_ELEM;
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+ [RT_CLASS_3] = {
+ .name = "radix tree node 3",
+ .fanout = 3,
+ .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_32_MIN] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_32_MAX] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_125] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(RT_NODE_INNER_256),
+ .leaf_size = sizeof(RT_NODE_LEAF_256),
+ },
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+ RT_HANDLE handle;
+ uint32 magic;
+ LWLock lock;
+#endif
+
+ RT_PTR_ALLOC root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+ MemoryContext context;
+
+ /* pointing to either local memory or DSA */
+ RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+ dsa_area *dsa;
+#else
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+ RT_PTR_LOCAL node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+ RT_RADIX_TREE *tree;
+
+ /* Track the iteration on nodes of each level */
+ RT_NODE_ITER stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is constructed during iteration */
+ uint64 key;
+} RT_ITER;
+
+
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_VALUE_TYPE *value_p);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+ return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+ return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+ return DsaPointerIsValid(ptr);
+#else
+ return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ /* replicate the search key */
+ spread_chunk = vector8_broadcast(chunk);
+
+ /* compare to all 32 keys stored in the node */
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+
+ /* convert comparison to a bitfield */
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+ /* mask off invalid entries */
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ /* convert bitfield to index by counting trailing zeros */
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ /*
+ * This is coded with '>=' to match what we can do with SIMD,
+ * with an assert to keep us honest.
+ */
+ if (node->chunks[index] >= chunk)
+ {
+ Assert(node->chunks[index] != chunk);
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ /*
+ * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+ * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+ * we need to play some trickery using vector8_min() to effectively get
+ * >=. There'll never be any equal elements in current uses, but that's
+ * what we get here...
+ */
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+ uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+ uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+ Assert(RT_NODE_IS_LEAF(node));
+ Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+ return node->children[chunk];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ Assert(RT_NODE_IS_LEAF(node));
+ Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ node->isset[idx] |= ((bitmapword) 1 << bitnum);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+ if (key == 0)
+ return 0;
+ else
+ return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+ RT_PTR_ALLOC allocnode;
+ size_t allocsize;
+
+ if (is_leaf)
+ allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+ else
+ allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+ allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+ if (is_leaf)
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ allocsize);
+ else
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+ allocsize);
+#endif
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->ctl->cnt[size_class]++;
+#endif
+
+ return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+ if (is_leaf)
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+ else
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+
+ node->kind = kind;
+
+ if (kind == RT_NODE_KIND_256)
+ /* See comment for the RT_NODE type */
+ Assert(node->fanout == 0);
+ else
+ node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+ memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
+ }
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+ int shift = RT_KEY_GET_SHIFT(key);
+ bool is_leaf = shift == 0;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ newnode->shift = shift;
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+ tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->count = oldnode->count;
+}
+
+/*
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
+ */
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+ uint8 new_kind, uint8 new_class, bool is_leaf)
+{
+ RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
+ RT_COPY_NODE(newnode, node);
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->ctl->root == allocnode)
+ {
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+ tree->ctl->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->ctl->cnt[i]--;
+ Assert(tree->ctl->cnt[i] >= 0);
+ }
+#endif
+
+#ifdef RT_SHMEM
+ dsa_free(tree->dsa, allocnode);
+#else
+ pfree(allocnode);
+#endif
+}
+
+/* Update the parent's pointer when growing a node */
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
+ RT_PTR_ALLOC new_child, uint64 key)
+{
+#ifdef USE_ASSERT_CHECKING
+ RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+ Assert(old_child->shift == new->shift);
+ Assert(old_child->count == new->count);
+#endif
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new larger node */
+ tree->ctl->root = new_child;
+ }
+ else
+ RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+ RT_FREE_NODE(tree, stored_old_child);
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+ int target_shift;
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ int shift = root->shift + RT_NODE_SPAN;
+
+ target_shift = RT_KEY_GET_SHIFT(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ RT_NODE_INNER_3 *n3;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+ node->shift = shift;
+ node->count = 1;
+
+ n3 = (RT_NODE_INNER_3 *) node;
+ n3->base.chunks[0] = 0;
+ n3->children[0] = tree->ctl->root;
+
+ /* Update the root */
+ tree->ctl->root = allocnode;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
+{
+ int shift = node->shift;
+
+ Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ RT_PTR_ALLOC allocchild;
+ RT_PTR_LOCAL newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool is_leaf = newshift == 0;
+
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ newchild->shift = newshift;
+ RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
+
+ parent = node;
+ node = newchild;
+ stored_node = allocchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value_p);
+ tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static bool
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+ RT_RADIX_TREE *tree;
+ MemoryContext old_ctx;
+#ifdef RT_SHMEM
+ dsa_pointer dp;
+#endif
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+ tree->context = ctx;
+
+#ifdef RT_SHMEM
+ tree->dsa = dsa;
+ dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+ tree->ctl->handle = dp;
+ tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+ LWLockInitialize(&tree->ctl->lock, tranche_id);
+#else
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+ /* Create a slab context for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+ size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+ size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ size_class.name,
+ inner_blocksize,
+ size_class.inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ size_class.name,
+ leaf_blocksize,
+ size_class.leaf_size);
+ }
+#endif
+
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+ RT_RADIX_TREE *tree;
+ dsa_pointer control;
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ tree->dsa = dsa;
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static inline void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();
+
+ /* The leaf node doesn't have child pointers */
+ if (RT_NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->dsa, ptr);
+ return;
+ }
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+ for (int i = 0; i < n3->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n3->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ for (int i = 0; i < n32->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+ }
+
+ break;
+ }
+ }
+
+ /* Free the inner node */
+ dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ /* Free all memory used for radix tree nodes */
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_FREE_RECURSE(tree, tree->ctl->root);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->dsa, tree->ctl->handle);
+#else
+ pfree(tree->ctl);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+#endif
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+ int shift;
+ bool updated;
+ RT_PTR_LOCAL parent;
+ RT_PTR_ALLOC stored_child;
+ RT_PTR_LOCAL child;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ RT_LOCK_EXCLUSIVE(tree);
+
+ /* Empty tree, create the root */
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_NEW_ROOT(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->ctl->max_val)
+ RT_EXTEND(tree, key);
+
+ stored_child = tree->ctl->root;
+ parent = RT_PTR_GET_LOCAL(tree, stored_child);
+ shift = parent->shift;
+
+ /* Descend the tree until we reach a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;;
+
+ child = RT_PTR_GET_LOCAL(tree, stored_child);
+
+ if (RT_NODE_IS_LEAF(child))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
+ {
+ RT_SET_EXTEND(tree, key, value_p, parent, stored_child, child);
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ parent = child;
+ stored_child = new_child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value_p);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->ctl->num_keys++;
+
+ RT_UNLOCK(tree);
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *value_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+ bool found;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+ Assert(value_p != NULL);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ shift = node->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+ if (RT_NODE_IS_LEAF(node))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ node = RT_PTR_GET_LOCAL(tree, child);
+ shift -= RT_NODE_SPAN;
+ }
+
+ found = RT_NODE_SEARCH_LEAF(node, key, value_p);
+
+ RT_UNLOCK(tree);
+ return found;
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ RT_LOCK_EXCLUSIVE(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+ /* Push the current node to the stack */
+ stack[++level] = allocnode;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ allocnode = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_LEAF(node, key);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->ctl->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (node->count > 0)
+ {
+ RT_UNLOCK(tree);
+ return true;
+ }
+
+ /* Free the empty leaf node */
+ RT_FREE_NODE(tree, allocnode);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ allocnode = stack[level--];
+
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_INNER(node, key);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (node->count > 0)
+ break;
+
+ /* The node became empty */
+ RT_FREE_NODE(tree, allocnode);
+ }
+
+ RT_UNLOCK(tree);
+ return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+ RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+ int level = from;
+ RT_PTR_LOCAL node = from_node;
+
+ for (;;)
+ {
+ RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (RT_NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/*
+ * Create and return the iterator for the given radix tree.
+ *
+ * The radix tree is locked in shared mode during the iteration, so
+ * RT_END_ITERATE needs to be called when finished to release the lock.
+ */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+ MemoryContext old_ctx;
+ RT_ITER *iter;
+ RT_PTR_LOCAL root;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+ iter->tree = tree;
+
+ RT_LOCK_SHARED(tree);
+
+ /* empty tree */
+ if (!iter->tree->ctl->root)
+ return iter;
+
+ root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+ top_level = root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->ctl->root)
+ return false;
+
+ for (;;)
+ {
+ RT_PTR_LOCAL child = NULL;
+ RT_VALUE_TYPE value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+/*
+ * Terminate the iteration and release the lock.
+ *
+ * This function needs to be called after finishing or when exiting an
+ * iteration.
+ */
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+#ifdef RT_SHMEM
+ Assert(LWLockHeldByMe(&iter->tree->ctl->lock));
+#endif
+
+ RT_UNLOCK(iter->tree);
+ pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+ Size total = 0;
+
+ RT_LOCK_SHARED(tree);
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ total = dsa_get_total_size(tree->dsa);
+#else
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+#endif
+
+ RT_UNLOCK(tree);
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
+
+ for (int i = 1; i < n3->n.count; i++)
+ Assert(n3->chunks[i - 1] < n3->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ uint8 slot = n125->slot_idxs[i];
+ int idx = RT_BM_IDX(slot);
+ int bitnum = RT_BM_BIT(slot);
+
+ if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(slot < node->fanout);
+ Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
+ cnt += bmw_popcount(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+ RT_LOCK_SHARED(tree);
+
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+ fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+ fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+
+ fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+ root->shift / RT_NODE_SPAN,
+ tree->ctl->cnt[RT_CLASS_3],
+ tree->ctl->cnt[RT_CLASS_32_MIN],
+ tree->ctl->cnt[RT_CLASS_32_MAX],
+ tree->ctl->cnt[RT_CLASS_125],
+ tree->ctl->cnt[RT_CLASS_256]);
+ }
+
+ RT_UNLOCK(tree);
+}
+
+static void
+RT_DUMP_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, int level,
+ bool recurse, StringInfo buf)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+ StringInfoData spaces;
+
+ initStringInfo(&spaces);
+ appendStringInfoSpaces(&spaces, (level * 4) + 1);
+
+ appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u, shift %u:\n",
+ spaces.data,
+ level == 0 ? "" : "-> ",
+ RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_3) ? 3 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+ spaces.data, i, n3->base.chunks[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X",
+ spaces.data, i, n3->base.chunks[i]);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, n3->children[i], level + 1,
+ recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+ spaces.data, i, n32->base.chunks[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X",
+ spaces.data, i, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, n32->children[i], level + 1,
+ recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+ char *sep = "";
+
+ appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ appendStringInfo(buf, "%s[%d]=%d ",
+ sep, i, b125->slot_idxs[i]);
+ sep = ",";
+ }
+
+ appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+ for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+ appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+ appendStringInfo(buf, "\n");
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ if (RT_NODE_IS_LEAF(node))
+ appendStringInfo(buf, "%schunk 0x%X\n",
+ spaces.data, i);
+ else
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+ appendStringInfo(buf, "%schunk 0x%X",
+ spaces.data, i);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i),
+ level + 1, recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+ for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+ appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+ appendStringInfo(buf, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ appendStringInfo(buf, "%schunk 0x%X\n",
+ spaces.data, i);
+ }
+ else
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ appendStringInfo(buf, "%schunk 0x%X",
+ spaces.data, i);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i),
+ level + 1, recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ StringInfoData buf;
+ int shift;
+ int level = 0;
+
+ RT_STATS(tree);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ if (key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+ key, key);
+ return;
+ }
+
+ initStringInfo(&buf);
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child;
+
+ RT_DUMP_NODE(tree, allocnode, level, false, &buf);
+
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_VALUE_TYPE dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+ break;
+ }
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ break;
+
+ allocnode = child;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+ RT_UNLOCK(tree);
+
+ fprintf(stderr, "%s", buf.data);
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+ StringInfoData buf;
+
+ RT_STATS(tree);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ initStringInfo(&buf);
+
+ RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+ RT_UNLOCK(tree);
+
+ fprintf(stderr, "%s",buf.data);
+}
+#endif
+
+#endif /* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef RT_BM_IDX
+#undef RT_BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_3
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_3
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_3
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
+#undef RT_CLASS_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_SWITCH_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..5f6dda1f12
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,122 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_delete_impl.h
+ * Common implementation for deletion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ * TODO: Shrink nodes when deletion would allow them to fit in a smaller
+ * size class.
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_delete_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+ n3->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+ n3->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
+ n32->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int idx;
+ int bitnum;
+
+ if (slotpos == RT_INVALID_SLOT_IDX)
+ return false;
+
+ idx = RT_BM_IDX(slotpos);
+ bitnum = RT_BM_BIT(slotpos);
+ n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+ n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+ RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+ break;
+ }
+ }
+
+ /* update statistics */
+ node->count--;
+
+ return true;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..c18e26b537
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,332 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_insert_impl.h
+ * Common implementation for insertion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_insert_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+ bool chunk_exists = false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ const bool is_leaf = true;
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+ const bool is_leaf = false;
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx;
+
+ idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n3->values[idx] = *value_p;
+#else
+ n3->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n3)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE32_TYPE *new32;
+ const uint8 new_kind = RT_NODE_KIND_32;
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
+
+ /* grow node from 3 to 32 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
+ new32->base.chunks, new32->values);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
+ new32->base.chunks, new32->children);
+#endif
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+ int count = n3->base.n.count;
+
+ /* shift chunks and children */
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
+ count, insertpos);
+#endif
+ }
+
+ n3->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n3->values[insertpos] = *value_p;
+#else
+ n3->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx;
+
+ idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[idx] = *value_p;
+#else
+ n32->children[idx] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n32)) &&
+ n32->base.n.fanout < class32_max.fanout)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+ Assert(n32->base.n.fanout == class32_min.fanout);
+
+ /* grow to the next size class of this kind */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ n32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ memcpy(newnode, node, class32_min.leaf_size);
+#else
+ memcpy(newnode, node, class32_min.inner_size);
+#endif
+ newnode->fanout = class32_max.fanout;
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n32)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE125_TYPE *new125;
+ const uint8 new_kind = RT_NODE_KIND_125;
+ const RT_SIZE_CLASS new_class = RT_CLASS_125;
+
+ Assert(n32->base.n.fanout == class32_max.fanout);
+
+ /* grow node from 32 to 125 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new125 = (RT_NODE125_TYPE *) newnode;
+
+ for (int i = 0; i < class32_max.fanout; i++)
+ {
+ new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ new125->values[i] = n32->values[i];
+#else
+ new125->children[i] = n32->children[i];
+#endif
+ }
+
+ /*
+ * Since we just copied a dense array, we can set the bits
+ * using a single store, provided the length of that array
+ * is at most the number of bits in a bitmapword.
+ */
+ Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+ new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+ int count = n32->base.n.count;
+
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+ count, insertpos);
+#endif
+ }
+
+ n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = *value_p;
+#else
+ n32->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int cnt = 0;
+
+ if (slotpos != RT_INVALID_SLOT_IDX)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = *value_p;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n125)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE256_TYPE *new256;
+ const uint8 new_kind = RT_NODE_KIND_256;
+ const RT_SIZE_CLASS new_class = RT_CLASS_256;
+
+ /* grow node from 125 to 256 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new256 = (RT_NODE256_TYPE *) newnode;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+ RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ cnt++;
+ }
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int idx;
+ bitmapword inverse;
+
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < RT_BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+ {
+ if (n125->base.isset[idx] < ~((bitmapword) 0))
+ break;
+ }
+
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(n125->base.isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+ Assert(slotpos < node->fanout);
+
+ /* mark the slot used */
+ n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+ n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = *value_p;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+#else
+ chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
+#endif
+ Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(n256, chunk, *value_p);
+#else
+ RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+ break;
+ }
+ }
+
+ /* Update statistics */
+ if (!chunk_exists)
+ node->count++;
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ RT_VERIFY_NODE(node);
+
+ return chunk_exists;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..98c78eb237
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,153 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_iter_impl.h
+ * Common implementation for iteration in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_iter_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ bool found = false;
+ uint8 key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_VALUE_TYPE value;
+
+ Assert(RT_NODE_IS_LEAF(node_iter->node));
+#else
+ RT_PTR_LOCAL child = NULL;
+
+ Assert(!RT_NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+ Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n3->base.n.count)
+ break;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n3->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+#endif
+ key_chunk = n3->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+#ifdef RT_NODE_LEVEL_LEAF
+ if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+ if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = value;
+#endif
+ }
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return found;
+#else
+ return child;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..a8925c75d0
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,138 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_search_impl.h
+ * Common implementation for search in leaf and inner nodes, plus
+ * update for inner nodes only.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_search_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(value_p != NULL);
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+ Assert(child_p != NULL);
+#endif
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n3->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = n3->values[idx];
+#else
+ *child_p = n3->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n32->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = n32->values[idx];
+#else
+ *child_p = n32->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+ Assert(slotpos != RT_INVALID_SLOT_IDX);
+ n125->children[slotpos] = new_child;
+#else
+ if (slotpos == RT_INVALID_SLOT_IDX)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+ *child_p = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+ RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+ *child_p = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ }
+
+#ifdef RT_ACTION_UPDATE
+ return;
+#else
+ return true;
+#endif /* RT_ACTION_UPDATE */
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..2af215484f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,6 +121,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
test_pg_db_role_setting \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
subdir('test_pg_db_role_setting')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..f944945db9
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,674 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int rt_node_kind_fanouts[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 125, /* RT_NODE_KIND_125 */
+ 256 /* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ TestValueType dummy;
+ uint64 key;
+ TestValueType val;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ rt_radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* look up keys */
+ for (int i = 0; i < children; i++)
+ {
+ TestValueType value;
+
+ if (!rt_search(radixtree, keys[i], &value))
+ elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (value != (TestValueType) keys[i])
+ elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+ value, (TestValueType) keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ TestValueType update = keys[i] + 1;
+ if (!rt_set(radixtree, keys[i], (TestValueType*) &update))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ TestValueType val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != (TestValueType) key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, (TestValueType*) &key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx - 1]
+ : rt_node_kind_fanouts[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx]
+ : rt_node_kind_fanouts[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+#else
+ radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, (TestValueType*) &x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ TestValueType v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != (TestValueType) x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ TestValueType val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != (TestValueType) expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ TestValueType v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ rt_free(radixtree);
+ MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ test_basic(rt_node_kind_fanouts[i], false);
+ test_basic(rt_node_kind_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
--
2.39.1
v26-0006-Adjust-some-inlining-declarations.patchtext/x-patch; charset=US-ASCII; name=v26-0006-Adjust-some-inlining-declarations.patchDownload
From 0726bb6b4e0250a72ce399d945d250724b4a29ab Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 6 Feb 2023 21:04:14 +0700
Subject: [PATCH v26 6/9] Adjust some inlining declarations
---
src/include/lib/radixtree.h | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d6919aef08..4bd0aaa810 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1124,7 +1124,7 @@ RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_le
* Create a new node as the root. Subordinate nodes will be created during
* the insertion.
*/
-static void
+static pg_noinline void
RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
{
int shift = RT_KEY_GET_SHIFT(key);
@@ -1215,7 +1215,7 @@ RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
/*
* Replace old_child with new_child, and free the old one.
*/
-static void
+static inline void
RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
RT_PTR_ALLOC new_child, uint64 key)
@@ -1242,7 +1242,7 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
* The radix tree doesn't have sufficient height. Extend the radix tree so
* it can store the key.
*/
-static void
+static pg_noinline void
RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
{
int target_shift;
@@ -1281,7 +1281,7 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
* The radix tree doesn't have inner and leaf nodes for given key-value pair.
* Insert inner and leaf nodes from 'node' to bottom.
*/
-static inline void
+static pg_noinline void
RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
{
@@ -1486,7 +1486,7 @@ RT_GET_HANDLE(RT_RADIX_TREE *tree)
/*
* Recursively free all nodes allocated to the DSA area.
*/
-static inline void
+static void
RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
{
RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
--
2.39.1
v26-0007-Skip-unnecessary-searches-in-RT_NODE_INSERT_INNE.patchtext/x-patch; charset=US-ASCII; name=v26-0007-Skip-unnecessary-searches-in-RT_NODE_INSERT_INNE.patchDownload
From 6831fe27a2c9c5765113b7903403c426f09f55f6 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 6 Feb 2023 22:04:50 +0700
Subject: [PATCH v26 7/9] Skip unnecessary searches in RT_NODE_INSERT_INNER
For inner nodes, we know the key chunk doesn't exist already,
otherwise we would have found it while descending the tree.
To reinforce this fact, declare this function to return void.
---
src/include/lib/radixtree.h | 4 +--
src/include/lib/radixtree_insert_impl.h | 48 ++++++++++++-------------
2 files changed, 24 insertions(+), 28 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 4bd0aaa810..1cdb995e54 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -685,7 +685,7 @@ typedef struct RT_ITER
} RT_ITER;
-static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+static void RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
uint64 key, RT_PTR_ALLOC child);
static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
uint64 key, RT_VALUE_TYPE *value_p);
@@ -1375,7 +1375,7 @@ RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
* If the node we're inserting into needs to grow, we update the parent's
* child pointer with the pointer to the new larger node.
*/
-static bool
+static void
RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
uint64 key, RT_PTR_ALLOC child)
{
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index c18e26b537..d56e58dcac 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -28,10 +28,10 @@
#endif
uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
- bool chunk_exists = false;
#ifdef RT_NODE_LEVEL_LEAF
const bool is_leaf = true;
+ bool chunk_exists = false;
Assert(RT_NODE_IS_LEAF(node));
#else
const bool is_leaf = false;
@@ -43,21 +43,18 @@
case RT_NODE_KIND_3:
{
RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
- int idx;
- idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+#ifdef RT_NODE_LEVEL_LEAF
+ int idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+
if (idx != -1)
{
/* found the existing chunk */
chunk_exists = true;
-#ifdef RT_NODE_LEVEL_LEAF
n3->values[idx] = *value_p;
-#else
- n3->children[idx] = child;
-#endif
break;
}
-
+#endif
if (unlikely(RT_NODE_MUST_GROW(n3)))
{
RT_PTR_ALLOC allocnode;
@@ -113,21 +110,18 @@
{
const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
- int idx;
- idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+#ifdef RT_NODE_LEVEL_LEAF
+ int idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+
if (idx != -1)
{
/* found the existing chunk */
chunk_exists = true;
-#ifdef RT_NODE_LEVEL_LEAF
n32->values[idx] = *value_p;
-#else
- n32->children[idx] = child;
-#endif
break;
}
-
+#endif
if (unlikely(RT_NODE_MUST_GROW(n32)) &&
n32->base.n.fanout < class32_max.fanout)
{
@@ -220,21 +214,19 @@
case RT_NODE_KIND_125:
{
RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
- int slotpos = n125->base.slot_idxs[chunk];
+ int slotpos;
int cnt = 0;
+#ifdef RT_NODE_LEVEL_LEAF
+ slotpos = n125->base.slot_idxs[chunk];
if (slotpos != RT_INVALID_SLOT_IDX)
{
/* found the existing chunk */
chunk_exists = true;
-#ifdef RT_NODE_LEVEL_LEAF
n125->values[slotpos] = *value_p;
-#else
- n125->children[slotpos] = child;
-#endif
break;
}
-
+#endif
if (unlikely(RT_NODE_MUST_GROW(n125)))
{
RT_PTR_ALLOC allocnode;
@@ -300,14 +292,10 @@
#ifdef RT_NODE_LEVEL_LEAF
chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
-#else
- chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
-#endif
Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
-
-#ifdef RT_NODE_LEVEL_LEAF
RT_NODE_LEAF_256_SET(n256, chunk, *value_p);
#else
+ Assert(node->count < RT_NODE_MAX_SLOTS);
RT_NODE_INNER_256_SET(n256, chunk, child);
#endif
break;
@@ -315,8 +303,12 @@
}
/* Update statistics */
+#ifdef RT_NODE_LEVEL_LEAF
if (!chunk_exists)
node->count++;
+#else
+ node->count++;
+#endif
/*
* Done. Finally, verify the chunk and value is inserted or replaced
@@ -324,7 +316,11 @@
*/
RT_VERIFY_NODE(node);
+#ifdef RT_NODE_LEVEL_LEAF
return chunk_exists;
+#else
+ return;
+#endif
#undef RT_NODE3_TYPE
#undef RT_NODE32_TYPE
--
2.39.1
v26-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchtext/x-patch; charset=US-ASCII; name=v26-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload
From f17e983832736a1daa64e67a10f9a64189b68210 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v26 8/9] Add TIDStore, to store sets of TIDs (ItemPointerData)
efficiently.
The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.
The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.
This includes a unit test module, in src/test/modules/test_tidstore.
---
doc/src/sgml/monitoring.sgml | 4 +
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 688 ++++++++++++++++++
src/backend/storage/lmgr/lwlock.c | 2 +
src/include/access/tidstore.h | 49 ++
src/include/storage/lwlock.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_tidstore/Makefile | 23 +
.../test_tidstore/expected/test_tidstore.out | 13 +
src/test/modules/test_tidstore/meson.build | 35 +
.../test_tidstore/sql/test_tidstore.sql | 7 +
.../test_tidstore/test_tidstore--1.0.sql | 8 +
.../modules/test_tidstore/test_tidstore.c | 195 +++++
.../test_tidstore/test_tidstore.control | 4 +
16 files changed, 1033 insertions(+)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
create mode 100644 src/test/modules/test_tidstore/Makefile
create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
create mode 100644 src/test/modules/test_tidstore/meson.build
create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
create mode 100644 src/test/modules/test_tidstore/test_tidstore.control
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..d936aa3da3 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2192,6 +2192,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Waiting to access a shared TID bitmap during a parallel bitmap
index scan.</entry>
</row>
+ <row>
+ <entry><literal>SharedTidStore</literal></entry>
+ <entry>Waiting to access a shared TID store.</entry>
+ </row>
<row>
<entry><literal>SharedTupleStore</literal></entry>
<entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..4c72673ce9
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,688 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach().
+ *
+ * Regarding the concurrency, it basically relies on the concurrency support in
+ * the radix tree, but we acquires the lock on a TidStore in some cases, for
+ * example, when to reset the store and when to access the number tids in the
+ * store (num_tids).
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, tids are represented as a pair of 64-bit key and
+ * 64-bit value. First, we construct 64-bit unsigned integer by combining
+ * the block number and the offset number. The number of bits used for the
+ * offset number is specified by max_offsets in tidstore_create(). We are
+ * frugal with the bits, because smaller keys could help keeping the radix
+ * tree shallow.
+ *
+ * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. That
+ * is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits
+ * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
+ * as the key:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ * |----| value
+ * |---------------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ *
+ * If the number of bits for offset number fits in a 64-bit value, we don't
+ * encode tids but directly use the block number and the offset number as key
+ * and value, respectively.
+ */
+#define TIDSTORE_VALUE_NBITS 6 /* log(64, 2) */
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The control object for a TidStore */
+typedef struct TidStoreControl
+{
+ /* the number of tids in the store */
+ int64 num_tids;
+
+ /* These values are never changed after creation */
+ size_t max_bytes; /* the maximum bytes a TidStore can use */
+ int max_offset; /* the maximum offset number */
+ int offset_nbits; /* the number of bits required for max_offset */
+ bool encode_tids; /* do we use tid encoding? */
+ int offset_key_nbits; /* the number of bits of a offset number
+ * used for the key */
+
+ /* The below fields are used only in shared case */
+
+ uint32 magic;
+ LWLock lock;
+
+ /* handles for TidStore and radix tree */
+ tidstore_handle handle;
+ shared_rt_handle tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+ /*
+ * Control object. This is allocated in DSA area 'area' in the shared
+ * case, otherwise in backend-local memory.
+ */
+ TidStoreControl *control;
+
+ /* Storage for Tids. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ local_rt_radix_tree *local;
+ shared_rt_radix_tree *shared;
+ } tree;
+
+ /* DSA area for TidStore if used */
+ dsa_area *area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+ TidStore *ts;
+
+ /* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ shared_rt_iter *shared;
+ local_rt_iter *local;
+ } tree_iter;
+
+ /* we returned all tids? */
+ bool finished;
+
+ /* save for the next iteration */
+ uint64 next_key;
+ uint64 next_val;
+
+ /* output for the caller */
+ TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+{
+ TidStore *ts;
+
+ ts = palloc0(sizeof(TidStore));
+
+ /*
+ * Create the radix tree for the main storage.
+ *
+ * Memory consumption depends on the number of stored tids, but also on the
+ * distribution of them, how the radix tree stores, and the memory management
+ * that backed the radix tree. The maximum bytes that a TidStore can
+ * use is specified by the max_bytes in tidstore_create(). We want the total
+ * amount of memory consumption by a TidStore not to exceed the max_bytes.
+ *
+ * In local TidStore cases, the radix tree uses slab allocators for each kind
+ * of node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+ * slab block for a new radix tree node, which is approximately 70kB. Therefore,
+ * we deduct 70kB from the max_bytes.
+ *
+ * In shared cases, DSA allocates the memory segments big enough to follow
+ * a geometric series that approximately doubles the total DSA size (see
+ * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+ * size and the simulation revealed, the 75% threshold for the maximum bytes
+ * perfectly works in case where the max_bytes is a power-of-2, and the 60%
+ * threshold works for other cases.
+ */
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+ float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+ LWTRANCHE_SHARED_TIDSTORE);
+
+ dp = dsa_allocate0(area, sizeof(TidStoreControl));
+ ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+ ts->control->max_bytes = (uint64) (max_bytes * ratio);
+ ts->area = area;
+
+ ts->control->magic = TIDSTORE_MAGIC;
+ LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+ ts->control->handle = dp;
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+ }
+ else
+ {
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+ ts->control->max_bytes = max_bytes - (70 * 1024);
+ }
+
+ ts->control->max_offset = max_offset;
+ ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+
+ /*
+ * We use tid encoding if the number of bits for the offset number doesn't
+ * fix in a value, uint64.
+ */
+ if (ts->control->offset_nbits > TIDSTORE_VALUE_NBITS)
+ {
+ ts->control->encode_tids = true;
+ ts->control->offset_key_nbits =
+ ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+ }
+ else
+ {
+ ts->control->encode_tids = false;
+ ts->control->offset_key_nbits = 0;
+ }
+
+ return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ /* create per-backend state */
+ ts = palloc0(sizeof(TidStore));
+
+ /* Find the control object in shared memory */
+ control = handle;
+
+ /* Set up the TidStore */
+ ts->control = (TidStoreControl *) dsa_get_address(area, control);
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+ ts->area = area;
+
+ return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ shared_rt_detach(ts->tree.shared);
+ pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix
+ * tree.
+ */
+ ts->control->magic = 0;
+ dsa_free(ts->area, ts->control->handle);
+ shared_rt_free(ts->tree.shared);
+ }
+ else
+ {
+ pfree(ts->control);
+ local_rt_free(ts->tree.local);
+ }
+
+ pfree(ts);
+}
+
+/*
+ * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * entire TidStore but recreate only the radix tree storage.
+ */
+void
+tidstore_reset(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /*
+ * Free the radix tree and return allocated DSA segments to
+ * the operating system.
+ */
+ shared_rt_free(ts->tree.shared);
+ dsa_trim(ts->area);
+
+ /* Recreate the radix tree */
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+ LWTRANCHE_SHARED_TIDSTORE);
+
+ /* update the radix tree handle as we recreated it */
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+
+ LWLockRelease(&ts->control->lock);
+ }
+ else
+ {
+ local_rt_free(ts->tree.local);
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+ }
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ ItemPointerData tid;
+ uint64 key_base;
+ uint64 *values;
+ int nkeys;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ if (ts->control->encode_tids)
+ {
+ key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+ nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+ }
+ else
+ {
+ key_base = (uint64) blkno;
+ nkeys = 1;
+ }
+ values = palloc0(sizeof(uint64) * nkeys);
+
+ ItemPointerSetBlockNumber(&tid, blkno);
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint64 key;
+ uint32 off;
+ int idx;
+
+ ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+ /* encode the tid to key and val */
+ key = tid_to_key_off(ts, &tid, &off);
+
+ idx = key - key_base;
+ Assert(idx >= 0 && idx < nkeys);
+
+ values[idx] |= UINT64CONST(1) << off;
+ }
+
+ if (TidStoreIsShared(ts))
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /* insert the calculated key-values to the tree */
+ for (int i = 0; i < nkeys; i++)
+ {
+ if (values[i])
+ {
+ uint64 key = key_base + i;
+
+ if (TidStoreIsShared(ts))
+ shared_rt_set(ts->tree.shared, key, &values[i]);
+ else
+ local_rt_set(ts->tree.local, key, &values[i]);
+ }
+ }
+
+ /* update statistics */
+ ts->control->num_tids += num_offsets;
+
+ if (TidStoreIsShared(ts))
+ LWLockRelease(&ts->control->lock);
+
+ pfree(values);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val = 0;
+ uint32 off;
+ bool found;
+
+ key = tid_to_key_off(ts, tid, &off);
+
+ if (TidStoreIsShared(ts))
+ found = shared_rt_search(ts->tree.shared, key, &val);
+ else
+ found = local_rt_search(ts->tree.local, key, &val);
+
+ if (!found)
+ return false;
+
+ return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during the
+ * iteration, so tidstore_end_iterate() needs to called when finished.
+ *
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+ TidStoreIter *iter;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ iter = palloc0(sizeof(TidStoreIter));
+ iter->ts = ts;
+
+ iter->result.blkno = InvalidBlockNumber;
+ iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
+
+ if (TidStoreIsShared(ts))
+ iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+ else
+ iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+ /* If the TidStore is empty, there is no business */
+ if (tidstore_num_tids(ts) == 0)
+ iter->finished = true;
+
+ return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+
+ return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a pointer to TidStoreIterResult that has tids
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+ TidStoreIterResult *result = &(iter->result);
+
+ if (iter->finished)
+ return NULL;
+
+ if (BlockNumberIsValid(result->blkno))
+ {
+ /* Process the previously collected key-value */
+ result->num_offsets = 0;
+ tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (tidstore_iter_kv(iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = key_get_blkno(iter->ts, key);
+
+ if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+ {
+ /*
+ * We got a key-value pair for a different block. So return the
+ * collected tids, and remember the key-value for the next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+ return result;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_extract_tids(iter, key, val);
+ }
+
+ iter->finished = true;
+ return result;
+}
+
+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+ if (TidStoreIsShared(iter->ts))
+ shared_rt_end_iterate(iter->tree_iter.shared);
+ else
+ local_rt_end_iterate(iter->tree_iter.local);
+
+ pfree(iter->result.offsets);
+ pfree(iter);
+}
+
+/* Return the number of tids we collected so far */
+int64
+tidstore_num_tids(TidStore *ts)
+{
+ uint64 num_tids;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ if (!TidStoreIsShared(ts))
+ return ts->control->num_tids;
+
+ LWLockAcquire(&ts->control->lock, LW_SHARED);
+ num_tids = ts->control->num_tids;
+ LWLockRelease(&ts->control->lock);
+
+ return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+tidstore_max_memory(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+tidstore_memory_usage(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+ return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->handle;
+}
+
+/* Extract tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+ TidStoreIterResult *result = (&iter->result);
+
+ for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ if (i > iter->ts->control->max_offset)
+ {
+ Assert(!iter->ts->control->encode_tids);
+ break;
+ }
+
+ if ((val & (UINT64CONST(1) << i)) == 0)
+ continue;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= i;
+
+ off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+
+ Assert(result->num_offsets < iter->ts->control->max_offset);
+ result->offsets[result->num_offsets++] = off;
+ }
+
+ result->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, uint64 key)
+{
+ if (ts->control->encode_tids)
+ return (BlockNumber) (key >> ts->control->offset_key_nbits);
+
+ return (BlockNumber) key;
+}
+
+/* Encode a tid to key and offset */
+static inline uint64
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off)
+{
+ uint64 key;
+ uint64 tid_i;
+
+ if (!ts->control->encode_tids)
+ {
+ *off = ItemPointerGetOffsetNumber(tid);
+
+ /* Use the block number as the key */
+ return (int64) ItemPointerGetBlockNumber(tid);
+ }
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << ts->control->offset_nbits;
+
+ *off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1);
+ key = tid_i >> TIDSTORE_VALUE_NBITS;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
"SharedTupleStore",
/* LWTRANCHE_SHARED_TIDBITMAP: */
"SharedTidBitmap",
+ /* LWTRANCHE_SHARED_TIDSTORE: */
+ "SharedTidStore",
/* LWTRANCHE_PARALLEL_APPEND: */
"ParallelAppend",
/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..a35a52124a
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+ BlockNumber blkno;
+ OffsetNumber *offsets;
+ int num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern int64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern size_t tidstore_max_memory(TidStore *ts);
+extern size_t tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif /* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
LWTRANCHE_SHARED_TUPLESTORE,
LWTRANCHE_SHARED_TIDBITMAP,
+ LWTRANCHE_SHARED_TIDSTORE,
LWTRANCHE_PARALLEL_APPEND,
LWTRANCHE_PER_XACT_PREDICATE_LIST,
LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
test_rls_hooks \
test_shm_mq \
test_slru \
+ test_tidstore \
unsafe_tests \
worker_spi
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
subdir('test_rls_hooks')
subdir('test_shm_mq')
subdir('test_slru')
+subdir('test_tidstore')
subdir('unsafe_tests')
subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+ $(WIN32RES) \
+ test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE: testing empty tidstore
+NOTICE: testing basic operations
+ test_tidstore
+---------------
+
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+ 'test_tidstore.c',
+)
+
+if host_system == 'windows'
+ test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_tidstore',
+ '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+ test_tidstore_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+ 'test_tidstore.control',
+ 'test_tidstore--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_tidstore',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_tidstore',
+ ],
+ },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..9b849ae8e8
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,195 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ * Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+ ItemPointerData tid;
+ bool found;
+
+ ItemPointerSet(&tid, blkno, off);
+
+ found = tidstore_lookup_tid(ts, &tid);
+
+ if (found != expect)
+ elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+ blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS 5
+#define TEST_TIDSTORE_NUM_OFFSETS 5
+
+ TidStore *ts;
+ TidStoreIter *iter;
+ TidStoreIterResult *iter_result;
+ BlockNumber blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+ };
+ BlockNumber blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+ };
+ OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+ int blk_idx;
+
+ /* prepare the offset array */
+ offs[0] = FirstOffsetNumber;
+ offs[1] = FirstOffsetNumber + 1;
+ offs[2] = max_offset / 2;
+ offs[3] = max_offset - 1;
+ offs[4] = max_offset;
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+
+ /* add tids */
+ for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+ tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* lookup test */
+ for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+ {
+ bool expect = false;
+ for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+ {
+ if (offs[i] == off)
+ {
+ expect = true;
+ break;
+ }
+ }
+
+ check_tid(ts, 0, off, expect);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, expect);
+ }
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+ elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+ tidstore_num_tids(ts),
+ TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* iteration test */
+ iter = tidstore_begin_iterate(ts);
+ blk_idx = 0;
+ while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+ {
+ /* check the returned block number */
+ if (blks_sorted[blk_idx] != iter_result->blkno)
+ elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+ iter_result->blkno, blks_sorted[blk_idx]);
+
+ /* check the returned offset numbers */
+ if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+ elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+ iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+ for (int i = 0; i < iter_result->num_offsets; i++)
+ {
+ if (offs[i] != iter_result->offsets[i])
+ elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+ iter_result->offsets[i], iter_result->blkno, offs[i]);
+ }
+
+ blk_idx++;
+ }
+
+ if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+ elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+ blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+ /* remove all tids */
+ tidstore_reset(ts);
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+ /* lookup test for empty store */
+ for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+ off++)
+ {
+ check_tid(ts, 0, off, false);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, false);
+ }
+
+ tidstore_destroy(ts);
+}
+
+static void
+test_empty(void)
+{
+ TidStore *ts;
+ TidStoreIter *iter;
+ ItemPointerData tid;
+
+ elog(NOTICE, "testing empty tidstore");
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+
+ ItemPointerSet(&tid, 0, FirstOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+ ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+ MaxBlockNumber, MaxOffsetNumber);
+
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+ if (tidstore_is_full(ts))
+ elog(ERROR, "tidstore_is_full on empty store returned true");
+
+ iter = tidstore_begin_iterate(ts);
+
+ if (tidstore_iterate_next(iter) != NULL)
+ elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+ tidstore_end_iterate(iter);
+
+ tidstore_destroy(ts);
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ elog(NOTICE, "testing basic operations");
+ test_basic(MaxHeapTuplesPerPage);
+ test_basic(10);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
--
2.39.1
v26-0009-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchtext/x-patch; charset=US-ASCII; name=v26-0009-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload
From ed8115b5f5c1b0745e35a0d6d72064ad9df4cf42 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 7 Feb 2023 17:19:29 +0700
Subject: [PATCH v26 9/9] Use TIDStore for storing dead tuple TID during lazy
vacuum
Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which was not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.
Now we use TIDStore to store dead tuple TIDs. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.
Since we are no longer able to exactly estimate the maximum number of
TIDs can be stored the pg_stat_progress_vacuum shows the progress
information based on the amount of memory in bytes. The column names
are also changed to max_dead_tuple_bytes and num_dead_tuple_bytes.
In addition, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, the inital DSA
segment size. Due to that, we increase the minimum value of
maintenance_work_mem (also autovacuum_work_mem) from 1MB to 2MB.
XXX: needs to bump catalog version
---
doc/src/sgml/monitoring.sgml | 8 +-
src/backend/access/heap/vacuumlazy.c | 278 ++++++++-------------
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 78 +-----
src/backend/commands/vacuumparallel.c | 73 +++---
src/backend/postmaster/autovacuum.c | 6 +-
src/backend/storage/lmgr/lwlock.c | 2 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/commands/progress.h | 4 +-
src/include/commands/vacuum.h | 25 +-
src/include/storage/lwlock.h | 1 +
src/test/regress/expected/cluster.out | 2 +-
src/test/regress/expected/create_index.out | 2 +-
src/test/regress/expected/rules.out | 4 +-
src/test/regress/sql/cluster.sql | 2 +-
src/test/regress/sql/create_index.sql | 2 +-
16 files changed, 177 insertions(+), 314 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d936aa3da3..0230c74e3d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6870,10 +6870,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>max_dead_tuples</structfield> <type>bigint</type>
+ <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples that we can store before needing to perform
+ Amount of dead tuple data that we can store before needing to perform
an index vacuum cycle, based on
<xref linkend="guc-maintenance-work-mem"/>.
</para></entry>
@@ -6881,10 +6881,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>num_dead_tuples</structfield> <type>bigint</type>
+ <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples collected since the last index vacuum cycle.
+ Amount of dead tuple data collected since the last index vacuum cycle.
</para></entry>
</row>
</tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..b4e40423a8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,18 @@
* vacuumlazy.c
* Concurrent ("lazy") vacuuming.
*
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs
* that are to be removed from indexes. We want to ensure we can vacuum even
* the very largest relations with finite memory space usage. To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
*
* We are willing to use at most maintenance_work_mem (or perhaps
* autovacuum_work_mem) memory space to keep track of dead TIDs. We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables). If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * create a TidStore with the maximum bytes that can be used by the TidStore.
+ * If the TidStore is full, we must call lazy_vacuum to vacuum indexes (and to
+ * vacuum the pages that we've pruned). This frees up the memory space dedicated
+ * to storing dead TIDs.
*
* In practice VACUUM will often complete its initial pass over the target
* heap relation without ever running out of space to store TIDs. This means
@@ -40,6 +40,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TidStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -220,11 +221,14 @@ typedef struct LVRelState
typedef struct LVPagePruneState
{
bool hastup; /* Page prevents rel truncation? */
- bool has_lpdead_items; /* includes existing LP_DEAD items */
+
+ /* collected offsets of LP_DEAD items including existing ones */
+ OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+ int num_offsets;
/*
* State describes the proper VM bit states to set for the page following
- * pruning and freezing. all_visible implies !has_lpdead_items, but don't
+ * pruning and freezing. all_visible implies num_offsets == 0, but don't
* trust all_frozen result unless all_visible is also set to true.
*/
bool all_visible; /* Every item visible to all? */
@@ -259,8 +263,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -487,11 +492,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
}
/*
- * Allocate dead_items array memory using dead_items_alloc. This handles
- * parallel VACUUM initialization as part of allocating shared memory
- * space used for dead_items. (But do a failsafe precheck first, to
- * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
- * is already dangerously old.)
+ * Allocate dead_items memory using dead_items_alloc. This handles parallel
+ * VACUUM initialization as part of allocating shared memory space used for
+ * dead_items. (But do a failsafe precheck first, to ensure that parallel
+ * VACUUM won't be attempted at all when relfrozenxid is already dangerously
+ * old.)
*/
lazy_check_wraparound_failsafe(vacrel);
dead_items_alloc(vacrel, params->nworkers);
@@ -797,7 +802,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
* have collected the TIDs whose index tuples need to be removed.
*
* Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- * largely consists of marking LP_DEAD items (from collected TID array)
+ * largely consists of marking LP_DEAD items (from vacrel->dead_items)
* as LP_UNUSED. This has to happen in a second, final pass over the
* heap, to preserve a basic invariant that all index AMs rely on: no
* extant index tuple can ever be allowed to contain a TID that points to
@@ -825,21 +830,21 @@ lazy_scan_heap(LVRelState *vacrel)
blkno,
next_unskippable_block,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TidStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
const int initprog_index[] = {
PROGRESS_VACUUM_PHASE,
PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
- PROGRESS_VACUUM_MAX_DEAD_TUPLES
+ PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
};
int64 initprog_val[3];
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +911,7 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ if (tidstore_is_full(vacrel->dead_items))
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -969,7 +973,7 @@ lazy_scan_heap(LVRelState *vacrel)
continue;
}
- /* Collect LP_DEAD items in dead_items array, count tuples */
+ /* Collect LP_DEAD items in dead_items, count tuples */
if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
&recordfreespace))
{
@@ -1011,14 +1015,14 @@ lazy_scan_heap(LVRelState *vacrel)
* Prune, freeze, and count tuples.
*
* Accumulates details of remaining LP_DEAD line pointers on page in
- * dead_items array. This includes LP_DEAD line pointers that we
- * pruned ourselves, as well as existing LP_DEAD line pointers that
- * were pruned some time earlier. Also considers freezing XIDs in the
- * tuple headers of remaining items with storage.
+ * dead_items. This includes LP_DEAD line pointers that we pruned
+ * ourselves, as well as existing LP_DEAD line pointers that were pruned
+ * some time earlier. Also considers freezing XIDs in the tuple headers
+ * of remaining items with storage.
*/
lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
- Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+ Assert(!prunestate.all_visible || (prunestate.num_offsets == 0));
/* Remember the location of the last page with nonremovable tuples */
if (prunestate.hastup)
@@ -1034,14 +1038,12 @@ lazy_scan_heap(LVRelState *vacrel)
* performed here can be thought of as the one-pass equivalent of
* a call to lazy_vacuum().
*/
- if (prunestate.has_lpdead_items)
+ if (prunestate.num_offsets > 0)
{
Size freespace;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
- /* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+ prunestate.num_offsets, buf, vmbuffer);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1080,16 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(tidstore_num_tids(dead_items) == 0);
+ }
+ else if (prunestate.num_offsets > 0)
+ {
+ /* Save details of the LP_DEAD items from the page in dead_items */
+ tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
+ prunestate.num_offsets);
+
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
}
/*
@@ -1145,7 +1156,7 @@ lazy_scan_heap(LVRelState *vacrel)
* There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
* set, however.
*/
- else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+ else if ((prunestate.num_offsets > 0) && PageIsAllVisible(page))
{
elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
vacrel->relname, blkno);
@@ -1193,7 +1204,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Final steps for block: drop cleanup lock, record free space in the
* FSM
*/
- if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+ if ((prunestate.num_offsets > 0) && vacrel->do_index_vacuuming)
{
/*
* Wait until lazy_vacuum_heap_rel() to save free space. This
@@ -1249,7 +1260,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (tidstore_num_tids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1524,9 +1535,9 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
* The approach we take now is to restart pruning when the race condition is
* detected. This allows heap_page_prune() to prune the tuples inserted by
* the now-aborted transaction. This is a little crude, but it guarantees
- * that any items that make it into the dead_items array are simple LP_DEAD
- * line pointers, and that every remaining item with tuple storage is
- * considered as a candidate for freezing.
+ * that any items that make it into the dead_items are simple LP_DEAD line
+ * pointers, and that every remaining item with tuple storage is considered
+ * as a candidate for freezing.
*/
static void
lazy_scan_prune(LVRelState *vacrel,
@@ -1543,13 +1554,11 @@ lazy_scan_prune(LVRelState *vacrel,
HTSV_Result res;
int tuples_deleted,
tuples_frozen,
- lpdead_items,
live_tuples,
recently_dead_tuples;
int nnewlpdead;
HeapPageFreeze pagefrz;
int64 fpi_before = pgWalUsage.wal_fpi;
- OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1571,7 +1580,6 @@ retry:
pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
tuples_deleted = 0;
tuples_frozen = 0;
- lpdead_items = 0;
live_tuples = 0;
recently_dead_tuples = 0;
@@ -1580,9 +1588,9 @@ retry:
*
* We count tuples removed by the pruning step as tuples_deleted. Its
* final value can be thought of as the number of tuples that have been
- * deleted from the table. It should not be confused with lpdead_items;
- * lpdead_items's final value can be thought of as the number of tuples
- * that were deleted from indexes.
+ * deleted from the table. It should not be confused with
+ * prunestate->deadoffsets; prunestate->deadoffsets's final value can
+ * be thought of as the number of tuples that were deleted from indexes.
*/
tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
InvalidTransactionId, 0, &nnewlpdead,
@@ -1593,7 +1601,7 @@ retry:
* requiring freezing among remaining tuples with storage
*/
prunestate->hastup = false;
- prunestate->has_lpdead_items = false;
+ prunestate->num_offsets = 0;
prunestate->all_visible = true;
prunestate->all_frozen = true;
prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1638,7 +1646,7 @@ retry:
* (This is another case where it's useful to anticipate that any
* LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
*/
- deadoffsets[lpdead_items++] = offnum;
+ prunestate->deadoffsets[prunestate->num_offsets++] = offnum;
continue;
}
@@ -1875,7 +1883,7 @@ retry:
*/
#ifdef USE_ASSERT_CHECKING
/* Note that all_frozen value does not matter when !all_visible */
- if (prunestate->all_visible && lpdead_items == 0)
+ if (prunestate->all_visible && prunestate->num_offsets == 0)
{
TransactionId cutoff;
bool all_frozen;
@@ -1888,28 +1896,9 @@ retry:
}
#endif
- /*
- * Now save details of the LP_DEAD items from the page in vacrel
- */
- if (lpdead_items > 0)
+ if (prunestate->num_offsets > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
-
vacrel->lpdead_item_pages++;
- prunestate->has_lpdead_items = true;
-
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
/*
* It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1928,7 +1917,7 @@ retry:
/* Finally, add page-local counts to whole-VACUUM counts */
vacrel->tuples_deleted += tuples_deleted;
vacrel->tuples_frozen += tuples_frozen;
- vacrel->lpdead_items += lpdead_items;
+ vacrel->lpdead_items += prunestate->num_offsets;
vacrel->live_tuples += live_tuples;
vacrel->recently_dead_tuples += recently_dead_tuples;
}
@@ -1940,7 +1929,7 @@ retry:
* lazy_scan_prune, which requires a full cleanup lock. While pruning isn't
* performed here, it's quite possible that an earlier opportunistic pruning
* operation left LP_DEAD items behind. We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items for removal from indexes.
*
* For aggressive VACUUM callers, we may return false to indicate that a full
* cleanup lock is required for processing by lazy_scan_prune. This is only
@@ -2099,7 +2088,7 @@ lazy_scan_noprune(LVRelState *vacrel,
vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
vacrel->NewRelminMxid = NoFreezePageRelminMxid;
- /* Save any LP_DEAD items found on the page in dead_items array */
+ /* Save any LP_DEAD items found on the page in dead_items */
if (vacrel->nindexes == 0)
{
/* Using one-pass strategy (since table has no indexes) */
@@ -2129,8 +2118,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TidStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2127,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2198,7 +2179,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
return;
}
@@ -2227,7 +2208,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2254,8 +2235,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2300,7 +2281,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
}
/*
@@ -2373,7 +2354,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2392,9 +2373,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
/*
* lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
*
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
*
* We may also be able to truncate the line pointer array of the heap pages we
* visit. If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2410,10 +2390,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index = 0;
BlockNumber vacuumed_pages = 0;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2409,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
VACUUM_ERRCB_PHASE_VACUUM_HEAP,
InvalidBlockNumber, InvalidOffsetNumber);
- while (index < vacrel->dead_items->num_items)
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ while ((result = tidstore_iterate_next(iter)) != NULL)
{
BlockNumber blkno;
Buffer buf;
@@ -2437,7 +2419,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ blkno = result->blkno;
vacrel->blkno = blkno;
/*
@@ -2451,7 +2433,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+ buf, vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2461,6 +2444,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
vacuumed_pages++;
}
+ tidstore_end_iterate(iter);
vacrel->blkno = InvalidBlockNumber;
if (BufferIsValid(vmbuffer))
@@ -2470,36 +2454,31 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+ vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
}
/*
- * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- * vacrel->dead_items array.
+ * lazy_vacuum_heap_page() -- free page's LP_DEAD items.
*
* Caller must have an exclusive buffer lock on the buffer (though a full
* cleanup lock is also acceptable). vmbuffer must be valid and already have
* a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page. The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *deadoffsets, int num_offsets, Buffer buffer,
+ Buffer vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int nunused = 0;
@@ -2518,16 +2497,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = deadoffsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -2687,8 +2660,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
* lazy_vacuum_one_index() -- vacuum index relation.
*
* Delete all the index tuples containing a TID collected in
- * vacrel->dead_items array. Also update running statistics.
- * Exact details depend on index AM's ambulkdelete routine.
+ * vacrel->dead_items. Also update running statistics. Exact
+ * details depend on index AM's ambulkdelete routine.
*
* reltuples is the number of heap tuples to be passed to the
* bulkdelete callback. It's always assumed to be estimated.
@@ -3094,48 +3067,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
}
/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
-/*
- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
+ * in vacrel for caller.
*
* Also handles parallel initialization as part of allocating dead_items in
* DSM when required.
@@ -3143,11 +3076,9 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
+ int vac_work_mem = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
@@ -3174,7 +3105,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
+ vac_work_mem, MaxHeapTuplesPerPage,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3187,11 +3118,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = tidstore_create(vac_work_mem, MaxHeapTuplesPerPage,
+ NULL);
}
/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..a526e607fe 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1165,7 +1165,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
END AS phase,
S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
- S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+ S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index aa79d9de4d..d8e680ca20 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int vac_cmp_itemptr(const void *left, const void *right);
/*
* Primary entry point for manual VACUUM and ANALYZE commands
@@ -2303,16 +2302,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TidStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ tidstore_num_tids(dead_items))));
return istat;
}
@@ -2343,82 +2342,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
* This has the right signature to be an IndexBulkDeleteCallback.
- *
- * Assumes dead_items array is sorted (in ascending TID order).
*/
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch(itemptr,
- dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
-
- return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
- BlockNumber lblk,
- rblk;
- OffsetNumber loff,
- roff;
-
- lblk = ItemPointerGetBlockNumber((ItemPointer) left);
- rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
- if (lblk < rblk)
- return -1;
- if (lblk > rblk)
- return 1;
-
- loff = ItemPointerGetOffsetNumber((ItemPointer) left);
- roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
- if (loff < roff)
- return -1;
- if (loff > roff)
- return 1;
+ TidStore *dead_items = (TidStore *) state;
- return 0;
+ return tidstore_lookup_tid(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..d653683693 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -9,12 +9,11 @@
* In a parallel vacuum, we perform both index bulk deletion and index cleanup
* with parallel worker processes. Individual indexes are processed by one
* vacuum process. ParalleVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment. We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit. Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * the shared TidStore. We launch parallel worker processes at the start of
+ * parallel index bulk-deletion and index cleanup and once all indexes are
+ * processed, the parallel worker processes exit. Each time we process indexes
+ * in parallel, the parallel context is re-initialized so that the same DSM can
+ * be used for multiple passes of index bulk-deletion and index cleanup.
*
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
@@ -103,6 +102,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TidStore */
+ tidstore_handle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -166,7 +168,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ dsa_area *dead_items_area;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -222,20 +225,23 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int max_offset, int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -283,9 +289,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -351,6 +356,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = tidstore_create(vac_work_mem, max_offset, dead_items_dsa);
+ pvs->dead_items = dead_items;
+ pvs->dead_items_area = dead_items_dsa;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -360,6 +375,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = tidstore_get_handle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +384,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -434,6 +441,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ tidstore_destroy(pvs->dead_items);
+ dsa_detach(pvs->dead_items_area);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -442,7 +452,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TidStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -940,7 +950,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -984,10 +996,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1045,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ tidstore_detach(pvs.dead_items);
+ dsa_detach(dead_items_area);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index ff6149a179..a371f6fbba 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3397,12 +3397,12 @@ check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
return true;
/*
- * We clamp manually-set values to at least 1MB. Since
+ * We clamp manually-set values to at least 2MB. Since
* maintenance_work_mem is always set to at least this value, do the same
* here.
*/
- if (*newval < 1024)
- *newval = 1024;
+ if (*newval < 2048)
+ *newval = 2048;
return true;
}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 55b3a04097..c223a7dc94 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -192,6 +192,8 @@ static const char *const BuiltinTrancheNames[] = {
"LogicalRepLauncherDSA",
/* LWTRANCHE_LAUNCHER_HASH: */
"LogicalRepLauncherHash",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index b46e3b8c55..27a88b9369 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2312,7 +2312,7 @@ struct config_int ConfigureNamesInt[] =
GUC_UNIT_KB
},
&maintenance_work_mem,
- 65536, 1024, MAX_KILOBYTES,
+ 65536, 2048, MAX_KILOBYTES,
NULL, NULL, NULL
},
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
#define PROGRESS_VACUUM_HEAP_BLKS_SCANNED 2
#define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED 3
#define PROGRESS_VACUUM_NUM_INDEX_VACUUMS 4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES 5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES 5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES 6
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..a3ebb169ef 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
MultiXactId MultiXactCutoff;
};
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TidStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int vac_work_mem, int max_offset,
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 07002fdfbe..537b34b30c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DATA,
LWTRANCHE_LAUNCHER_DSA,
LWTRANCHE_LAUNCHER_HASH,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
-- ensure we don't use the index in CLUSTER nor the checking SELECTs
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
-- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..f6ae02eb14 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param3 AS heap_blks_scanned,
s.param4 AS heap_blks_vacuumed,
s.param5 AS index_vacuum_count,
- s.param6 AS max_dead_tuples,
- s.param7 AS num_dead_tuples
+ s.param6 AS max_dead_tuple_bytes,
+ s.param7 AS dead_tuple_bytes
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
--
2.39.1
Hi,
On Tue, Feb 7, 2023 at 6:25 PM John Naylor <john.naylor@enterprisedb.com> wrote:
On Tue, Jan 31, 2023 at 9:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've attached v24 patches. The locking support patch is separated
(0005 patch). Also I kept the updates for TidStore and the vacuum
integration from v23 separate.Okay, that's a lot more simple, and closer to what I imagined. For v25, I squashed v24's additions and added a couple of my own. I've kept the CF status at "needs review" because no specific action is required at the moment.
I did start to review the TID store some more, but that's on hold because something else came up: On a lark I decided to re-run some benchmarks to see if anything got lost in converting to a template, and that led me down a rabbit hole -- some good and bad news on that below.
0001:
I removed the uint64 case, as discussed. There is now a brief commit message, but needs to be fleshed out a bit. I took another look at the Arm optimization that Nathan found some month ago, for forming the highbit mask, but that doesn't play nicely with how node32 uses it, so I decided against it. I added a comment to describe the reasoning in case someone else gets a similar idea.
I briefly looked into "separate-commit TODO: move non-SIMD fallbacks to their own header to clean up the #ifdef maze.", but decided it wasn't such a clear win to justify starting the work now. It's still in the back of my mind, but I removed the reminder from the commit message.
The changes make sense to me.
0003:
The template now requires the value to be passed as a pointer. That was a pretty trivial change, but affected multiple other patches, so not sent separately. Also adds a forgotten RT_ prefix to the bitmap macros and adds a top comment to the *_impl.h headers. There are some comment fixes. The changes were either trivial or discussed earlier, so also not sent separately.
Great.
0004/5: I wanted to measure the load time as well as search time in bench_search_random_nodes(). That's kept separate to make it easier to test other patch versions.
The bad news is that the speed of loading TIDs in bench_seq/shuffle_search() has regressed noticeably. I can't reproduce this in any other bench function and was the reason for writing 0005 to begin with. More confusingly, my efforts to fix this improved *other* functions, but the former didn't budge at all. First the patches:
0006 adds and removes some "inline" declarations (where it made sense), and added some for "pg_noinline" based on Andres' advice some months ago.
Agreed.
0007 removes some dead code. RT_NODE_INSERT_INNER is only called during RT_SET_EXTEND, so it can't possibly find an existing key. This kind of change is much easier with the inner/node cases handled together in a template, as far as being sure of how those cases are different. I thought about trying the search in assert builds and verifying it doesn't exist, but thought yet another #ifdef would be too messy.
Agreed.
v25-addendum-try-no-maintain-order.txt -- It makes optional keeping the key chunks in order for the linear-search nodes. I believe the TID store no longer cares about the ordering, but this is a text file for now because I don't want to clutter the CI with a behavior change. Also, the second ART paper (on concurrency) mentioned that some locking schemes don't allow these arrays to be shifted. So it might make sense to give up entirely on guaranteeing ordered iteration, or at least make it optional as in the patch.
I think it's still important for lazy vacuum that an iteration over a
TID store returns TIDs in ascending order, because otherwise a heap
vacuum does random writes. That being said, we can have
RT_ITERATE_NEXT() return key-value pairs in an order regardless of how
the key chunks are stored in a node.
========================================
psql -c "select rt_load_ms, rt_search_ms from bench_seq_search(0, 1 * 1000 * 1000)"
(min load time of three)v15:
rt_load_ms | rt_search_ms
------------+--------------
113 | 455v25-0005:
rt_load_ms | rt_search_ms
------------+--------------
135 | 456v25-0006 (inlining or not):
rt_load_ms | rt_search_ms
------------+--------------
136 | 455v25-0007 (remove dead code):
rt_load_ms | rt_search_ms
------------+--------------
135 | 455v25-addendum...txt (no ordering):
rt_load_ms | rt_search_ms
------------+--------------
134 | 455Note: The regression seems to have started in v17, which is the first with a full template.
Nothing so far has helped here, and previous experience has shown that trying to profile 100ms will not be useful. Instead of putting more effort into diving deeper, it seems a better use of time to write a benchmark that calls the tid store itself. That's more realistic, since this function was intended to test load and search of tids, but the tid store doesn't quite operate so simply anymore. What do you think, Masahiko?
Yeah, that's more realistic. TidStore now encodes TIDs slightly
differently from the benchmark test.
I've attached the patch that adds a simple benchmark test using
TidStore. With this test, I got similar trends of results to yours
with gcc, but I've not analyzed them in depth yet.
query: select * from bench_tidstore_load(0, 10 * 1000 * 1000)
v15:
load_ms
---------
816
v25-0007 (remove dead code):
load_ms
---------
839
v25-addendum...txt (no ordering):
load_ms
---------
820
BTW it would be better to remove the RT_DEBUG macro from bench_radix_tree.c.
I'm inclined to keep 0006, because it might give a slight boost, and 0007 because it's never a bad idea to remove dead code.
Yeah, these two changes make sense to me too.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
0001-Add-bench_tidstore_load.patch.txttext/plain; charset=US-ASCII; name=0001-Add-bench_tidstore_load.patch.txtDownload
From e056133360436e115a434a8a21685a99602a5b5d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 8 Feb 2023 15:53:14 +0900
Subject: [PATCH] Add bench_tidstore_load()
---
.../bench_radix_tree--1.0.sql | 10 ++++
contrib/bench_radix_tree/bench_radix_tree.c | 46 +++++++++++++++++++
2 files changed, 56 insertions(+)
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 95eedbbe10..fbf51c1086 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -75,3 +75,13 @@ OUT rt_sparseload_ms int8
returns record
as 'MODULE_PATHNAME'
LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_tidstore_load(
+minblk int4,
+maxblk int4,
+OUT mem_allocated int8,
+OUT load_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 7d1e2eee57..3c2caa3b90 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -9,6 +9,7 @@
*/
#include "postgres.h"
+#include "access/tidstore.h"
#include "common/pg_prng.h"
#include "fmgr.h"
#include "funcapi.h"
@@ -54,6 +55,7 @@ PG_FUNCTION_INFO_V1(bench_load_random_int);
PG_FUNCTION_INFO_V1(bench_fixed_height_search);
PG_FUNCTION_INFO_V1(bench_search_random_nodes);
PG_FUNCTION_INFO_V1(bench_node128_load);
+PG_FUNCTION_INFO_V1(bench_tidstore_load);
static uint64
tid_to_key_off(ItemPointer tid, uint32 *off)
@@ -168,6 +170,50 @@ vac_cmp_itemptr(const void *left, const void *right)
}
#endif
+Datum
+bench_tidstore_load(PG_FUNCTION_ARGS)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ TidStore *ts;
+ OffsetNumber *offs;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_ms;
+ TupleDesc tupdesc;
+ Datum values[2];
+ bool nulls[2] = {false};
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ offs = palloc(sizeof(OffsetNumber) * TIDS_PER_BLOCK_FOR_LOAD);
+ for (int i = 0; i < TIDS_PER_BLOCK_FOR_LOAD; i++)
+ offs[i] = i + 1; /* FirstOffsetNumber is 1 */
+
+ ts = tidstore_create(1 * 1024L * 1024L * 1024L, MaxHeapTuplesPerPage, NULL);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* load tids */
+ start_time = GetCurrentTimestamp();
+ for (BlockNumber blkno = minblk; blkno < maxblk; blkno++)
+ tidstore_add_tids(ts, blkno, offs, TIDS_PER_BLOCK_FOR_LOAD);
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_ms = secs * 1000 + usecs / 1000;
+
+ values[0] = Int64GetDatum(tidstore_memory_usage(ts));
+ values[1] = Int64GetDatum(load_ms);
+
+ tidstore_destroy(ts);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
static Datum
bench_search(FunctionCallInfo fcinfo, bool shuffle)
{
--
2.31.1
On Thu, Feb 9, 2023 at 2:08 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
I think it's still important for lazy vacuum that an iteration over a
TID store returns TIDs in ascending order, because otherwise a heap
vacuum does random writes. That being said, we can have
RT_ITERATE_NEXT() return key-value pairs in an order regardless of how
the key chunks are stored in a node.
Okay, we can keep that possibility in mind if we need to go there.
Note: The regression seems to have started in v17, which is the first
with a full template.
0007 removes some dead code. RT_NODE_INSERT_INNER is only called during
RT_SET_EXTEND, so it can't possibly find an existing key. This kind of
change is much easier with the inner/node cases handled together in a
template, as far as being sure of how those cases are different. I thought
about trying the search in assert builds and verifying it doesn't exist,
but thought yet another #ifdef would be too messy.
It just occurred to me that these facts might be related. v17 was the first
use of the full template, and I decided then I liked one of your earlier
patches where replace_node() calls node_update_inner() better than calling
node_insert_inner() with a NULL parent, which was a bit hard to understand.
That now-dead code was actually used in the latter case for updating the
(original) parent. It's possible that trying to use separate paths
contributed to the regression. I'll try the other way and report back.
I've attached the patch that adds a simple benchmark test using
TidStore. With this test, I got similar trends of results to yours
with gcc, but I've not analyzed them in depth yet.
Thanks for that! I'll take a look.
BTW it would be better to remove the RT_DEBUG macro from
bench_radix_tree.c.
Absolutely.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Feb 9, 2023 at 2:08 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
query: select * from bench_tidstore_load(0, 10 * 1000 * 1000)
v15:
load_ms
---------
816
How did you build the tid store and test on v15? I first tried to
apply v15-0009-PoC-lazy-vacuum-integration.patch, which conflicts with
vacuum now, so reset all that, but still getting build errors because the
tid store types and functions have changed.
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Feb 10, 2023 at 3:51 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Thu, Feb 9, 2023 at 2:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
query: select * from bench_tidstore_load(0, 10 * 1000 * 1000)
v15:
load_ms
---------
816How did you build the tid store and test on v15? I first tried to apply v15-0009-PoC-lazy-vacuum-integration.patch, which conflicts with vacuum now, so reset all that, but still getting build errors because the tid store types and functions have changed.
I applied v26-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
on top of v15 radix tree and changed the TidStore so that it uses v15
(non-templated) radixtree. That way, we can test TidStore using v15
radix tree. I've attached the patch that I applied on top of
v26-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
change_tidstore_for_v15.patchapplication/octet-stream; name=change_tidstore_for_v15.patchDownload
commit f2d6acbce26d7e05e64666ae00fca030a657de76
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed Feb 8 15:52:47 2023 +0900
Add TidStore from v26 patch.
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 4c72673ce9..5048400a9f 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -29,6 +29,7 @@
#include "access/tidstore.h"
#include "miscadmin.h"
#include "port/pg_bitutils.h"
+#include "lib/radixtree.h"
#include "storage/lwlock.h"
#include "utils/dsa.h"
#include "utils/memutils.h"
@@ -74,21 +75,6 @@
/* A magic value used to identify our TidStores. */
#define TIDSTORE_MAGIC 0x826f6a10
-#define RT_PREFIX local_rt
-#define RT_SCOPE static
-#define RT_DECLARE
-#define RT_DEFINE
-#define RT_VALUE_TYPE uint64
-#include "lib/radixtree.h"
-
-#define RT_PREFIX shared_rt
-#define RT_SHMEM
-#define RT_SCOPE static
-#define RT_DECLARE
-#define RT_DEFINE
-#define RT_VALUE_TYPE uint64
-#include "lib/radixtree.h"
-
/* The control object for a TidStore */
typedef struct TidStoreControl
{
@@ -110,7 +96,6 @@ typedef struct TidStoreControl
/* handles for TidStore and radix tree */
tidstore_handle handle;
- shared_rt_handle tree_handle;
} TidStoreControl;
/* Per-backend state for a TidStore */
@@ -125,14 +110,9 @@ struct TidStore
/* Storage for Tids. Use either one depending on TidStoreIsShared() */
union
{
- local_rt_radix_tree *local;
- shared_rt_radix_tree *shared;
+ radix_tree *local;
} tree;
-
- /* DSA area for TidStore if used */
- dsa_area *area;
};
-#define TidStoreIsShared(ts) ((ts)->area != NULL)
/* Iterator for TidStore */
typedef struct TidStoreIter
@@ -142,8 +122,8 @@ typedef struct TidStoreIter
/* iterator of radix tree. Use either one depending on TidStoreIsShared() */
union
{
- shared_rt_iter *shared;
- local_rt_iter *local;
+ rt_iter *shared;
+ rt_iter *local;
} tree_iter;
/* we returned all tids? */
@@ -194,31 +174,10 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
* perfectly works in case where the max_bytes is a power-of-2, and the 60%
* threshold works for other cases.
*/
- if (area != NULL)
- {
- dsa_pointer dp;
- float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
-
- ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
- LWTRANCHE_SHARED_TIDSTORE);
-
- dp = dsa_allocate0(area, sizeof(TidStoreControl));
- ts->control = (TidStoreControl *) dsa_get_address(area, dp);
- ts->control->max_bytes = (uint64) (max_bytes * ratio);
- ts->area = area;
+ ts->tree.local = rt_create(CurrentMemoryContext);
- ts->control->magic = TIDSTORE_MAGIC;
- LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
- ts->control->handle = dp;
- ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
- }
- else
- {
- ts->tree.local = local_rt_create(CurrentMemoryContext);
-
- ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
- ts->control->max_bytes = max_bytes - (70 * 1024);
- }
+ ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+ ts->control->max_bytes = max_bytes - (70 * 1024);
ts->control->max_offset = max_offset;
ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
@@ -242,50 +201,6 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
return ts;
}
-/*
- * Attach to the shared TidStore using a handle. The returned object is
- * allocated in backend-local memory using the CurrentMemoryContext.
- */
-TidStore *
-tidstore_attach(dsa_area *area, tidstore_handle handle)
-{
- TidStore *ts;
- dsa_pointer control;
-
- Assert(area != NULL);
- Assert(DsaPointerIsValid(handle));
-
- /* create per-backend state */
- ts = palloc0(sizeof(TidStore));
-
- /* Find the control object in shared memory */
- control = handle;
-
- /* Set up the TidStore */
- ts->control = (TidStoreControl *) dsa_get_address(area, control);
- Assert(ts->control->magic == TIDSTORE_MAGIC);
-
- ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
- ts->area = area;
-
- return ts;
-}
-
-/*
- * Detach from a TidStore. This detaches from radix tree and frees the
- * backend-local resources. The radix tree will continue to exist until
- * it is either explicitly destroyed, or the area that backs it is returned
- * to the operating system.
- */
-void
-tidstore_detach(TidStore *ts)
-{
- Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
-
- shared_rt_detach(ts->tree.shared);
- pfree(ts);
-}
-
/*
* Destroy a TidStore, returning all memory.
*
@@ -298,25 +213,8 @@ tidstore_detach(TidStore *ts)
void
tidstore_destroy(TidStore *ts)
{
- if (TidStoreIsShared(ts))
- {
- Assert(ts->control->magic == TIDSTORE_MAGIC);
-
- /*
- * Vandalize the control block to help catch programming error where
- * other backends access the memory formerly occupied by this radix
- * tree.
- */
- ts->control->magic = 0;
- dsa_free(ts->area, ts->control->handle);
- shared_rt_free(ts->tree.shared);
- }
- else
- {
- pfree(ts->control);
- local_rt_free(ts->tree.local);
- }
-
+ pfree(ts->control);
+ rt_free(ts->tree.local);
pfree(ts);
}
@@ -327,39 +225,11 @@ tidstore_destroy(TidStore *ts)
void
tidstore_reset(TidStore *ts)
{
- if (TidStoreIsShared(ts))
- {
- Assert(ts->control->magic == TIDSTORE_MAGIC);
-
- LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
-
- /*
- * Free the radix tree and return allocated DSA segments to
- * the operating system.
- */
- shared_rt_free(ts->tree.shared);
- dsa_trim(ts->area);
+ rt_free(ts->tree.local);
+ ts->tree.local = rt_create(CurrentMemoryContext);
- /* Recreate the radix tree */
- ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
- LWTRANCHE_SHARED_TIDSTORE);
-
- /* update the radix tree handle as we recreated it */
- ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
-
- /* Reset the statistics */
- ts->control->num_tids = 0;
-
- LWLockRelease(&ts->control->lock);
- }
- else
- {
- local_rt_free(ts->tree.local);
- ts->tree.local = local_rt_create(CurrentMemoryContext);
-
- /* Reset the statistics */
- ts->control->num_tids = 0;
- }
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
}
/* Add Tids on a block to TidStore */
@@ -372,8 +242,6 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
uint64 *values;
int nkeys;
- Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
-
if (ts->control->encode_tids)
{
key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
@@ -404,9 +272,6 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
values[idx] |= UINT64CONST(1) << off;
}
- if (TidStoreIsShared(ts))
- LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
-
/* insert the calculated key-values to the tree */
for (int i = 0; i < nkeys; i++)
{
@@ -414,19 +279,13 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
{
uint64 key = key_base + i;
- if (TidStoreIsShared(ts))
- shared_rt_set(ts->tree.shared, key, &values[i]);
- else
- local_rt_set(ts->tree.local, key, &values[i]);
+ rt_set(ts->tree.local, key, values[i]);
}
}
/* update statistics */
ts->control->num_tids += num_offsets;
- if (TidStoreIsShared(ts))
- LWLockRelease(&ts->control->lock);
-
pfree(values);
}
@@ -441,10 +300,7 @@ tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
key = tid_to_key_off(ts, tid, &off);
- if (TidStoreIsShared(ts))
- found = shared_rt_search(ts->tree.shared, key, &val);
- else
- found = local_rt_search(ts->tree.local, key, &val);
+ found = rt_search(ts->tree.local, key, &val);
if (!found)
return false;
@@ -464,18 +320,13 @@ tidstore_begin_iterate(TidStore *ts)
{
TidStoreIter *iter;
- Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
-
iter = palloc0(sizeof(TidStoreIter));
iter->ts = ts;
iter->result.blkno = InvalidBlockNumber;
iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
- if (TidStoreIsShared(ts))
- iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
- else
- iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+ iter->tree_iter.local = rt_begin_iterate(ts->tree.local);
/* If the TidStore is empty, there is no business */
if (tidstore_num_tids(ts) == 0)
@@ -487,10 +338,7 @@ tidstore_begin_iterate(TidStore *ts)
static inline bool
tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
{
- if (TidStoreIsShared(iter->ts))
- return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
-
- return local_rt_iterate_next(iter->tree_iter.local, key, val);
+ return rt_iterate_next(iter->tree_iter.local, key, val);
}
/*
@@ -547,10 +395,7 @@ tidstore_iterate_next(TidStoreIter *iter)
void
tidstore_end_iterate(TidStoreIter *iter)
{
- if (TidStoreIsShared(iter->ts))
- shared_rt_end_iterate(iter->tree_iter.shared);
- else
- local_rt_end_iterate(iter->tree_iter.local);
+ rt_end_iterate(iter->tree_iter.local);
pfree(iter->result.offsets);
pfree(iter);
@@ -560,26 +405,13 @@ tidstore_end_iterate(TidStoreIter *iter)
int64
tidstore_num_tids(TidStore *ts)
{
- uint64 num_tids;
-
- Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
-
- if (!TidStoreIsShared(ts))
- return ts->control->num_tids;
-
- LWLockAcquire(&ts->control->lock, LW_SHARED);
- num_tids = ts->control->num_tids;
- LWLockRelease(&ts->control->lock);
-
- return num_tids;
+ return ts->control->num_tids;
}
/* Return true if the current memory usage of TidStore exceeds the limit */
bool
tidstore_is_full(TidStore *ts)
{
- Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
-
return (tidstore_memory_usage(ts) > ts->control->max_bytes);
}
@@ -587,8 +419,6 @@ tidstore_is_full(TidStore *ts)
size_t
tidstore_max_memory(TidStore *ts)
{
- Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
-
return ts->control->max_bytes;
}
@@ -596,17 +426,7 @@ tidstore_max_memory(TidStore *ts)
size_t
tidstore_memory_usage(TidStore *ts)
{
- Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
-
- /*
- * In the shared case, TidStoreControl and radix_tree are backed by the
- * same DSA area and rt_memory_usage() returns the value including both.
- * So we don't need to add the size of TidStoreControl separately.
- */
- if (TidStoreIsShared(ts))
- return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
-
- return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
+ return sizeof(TidStore) + sizeof(TidStore) + rt_memory_usage(ts->tree.local);
}
/*
@@ -615,7 +435,6 @@ tidstore_memory_usage(TidStore *ts)
tidstore_handle
tidstore_get_handle(TidStore *ts)
{
- Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
return ts->control->handle;
}
I didn't get any closer to radix-tree regression, but I did find some
inefficiencies in tidstore_add_tids() that are worth talking about first,
addressed in a rough fashion in the attached .txt addendums that I can
clean up and incorporate later.
To start, I can reproduce the regression with this test as well:
select * from bench_tidstore_load(0, 10 * 1000 * 1000);
v15 + v26 store + adjustments:
mem_allocated | load_ms
---------------+---------
98202152 | 1676
v26 0001-0008
mem_allocated | load_ms
---------------+---------
98202032 | 1826
...and reverting to the alternate way to update the parent didn't help:
v26 0001-6, 0008, insert_inner w/ null parent
mem_allocated | load_ms
---------------+---------
98202032 | 1825
...and I'm kind of glad that wasn't the problem, because going back to that
would be a pain for the shmem case.
Running perf doesn't show anything much different in the proportions (note
that rt_set must have been inlined when declared locally in v26):
v15 + v26 store + adjustments:
65.88% postgres postgres [.] tidstore_add_tids
10.74% postgres postgres [.] rt_set
9.20% postgres postgres [.] palloc0
6.49% postgres postgres [.] rt_node_insert_leaf
v26 0001-0008
78.50% postgres postgres [.] tidstore_add_tids
8.88% postgres postgres [.] palloc0
6.24% postgres postgres [.] local_rt_node_insert_leaf
v2699-0001: The first thing I noticed is that palloc0 is taking way more
time than it should, and it's because the compiler doesn't know the
values[] array is small. One reason we need to zero the array is to make
the algorithm agnostic about what order the offsets come in, as I requested
in a previous review. Thinking some more, I was way too paranoid about
that. As long as access methods scan the line pointer array in the usual
way, maybe we can just assert that the keys we create are in order, and
zero any unused array entries as we find them. (I admit I can't actually
think of a reason we would ever encounter offsets out of order.) Also, we
can keep track of the last key we need to consider for insertion into the
radix tree, and ignore the rest. That might shave a few cycles during the
exclusive lock when the max offset of an LP_DEAD item < 64 on a given page,
which I think would be common in the wild. I also got rid of the special
case for non-encoding, since shifting by zero should work the same way.
These together led to a nice speedup on the v26 branch:
mem_allocated | load_ms
---------------+---------
98202032 | 1386
v2699-0002: The next thing I noticed is forming a full ItemIdPointer to
pass to tid_to_key_off(). That's bad for tidstore_add_tids() because
ItemPointerSetBlockNumber() must do this in order to allow the struct to be
SHORTALIGN'd:
static inline void
BlockIdSet(BlockIdData *blockId, BlockNumber blockNumber)
{
blockId->bi_hi = blockNumber >> 16;
blockId->bi_lo = blockNumber & 0xffff;
}
Then, tid_to_key_off() calls ItemPointerGetBlockNumber(), which must
reverse the above process:
static inline BlockNumber
BlockIdGetBlockNumber(const BlockIdData *blockId)
{
return (((BlockNumber) blockId->bi_hi) << 16) | ((BlockNumber)
blockId->bi_lo);
}
There is no reason to do any of this if we're not reading/writing directly
to/from an on-disk tid etc. To avoid this, I created a new function
encode_key_off() [name could be better], which deals with the raw block
number that we already have. Then turn tid_to_key_off() into a wrapper
around that, since we still need the full conversion for
tidstore_lookup_tid().
v2699-0003: Get rid of all the remaining special cases for encoding/or not.
I am unaware of the need to optimize that case or treat it in any way
differently. I haven't tested this on an installation with non-default
blocksize and didn't measure this separately, but 0002+0003 gives:
mem_allocated | load_ms
---------------+---------
98202032 | 1259
If these are acceptable, I can incorporate them into a later patchset. In
any case, speeding up tidstore_add_tids() will make any regressions in the
backing radix tree more obvious. I will take a look at that next week.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v2699-0002-Do-less-work-when-encoding-key-value.patch.txttext/plain; charset=US-ASCII; name=v2699-0002-Do-less-work-when-encoding-key-value.patch.txtDownload
From 6bdd33fa4f55757b54d16ce00dc60a21b929606e Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sat, 11 Feb 2023 10:45:21 +0700
Subject: [PATCH v2699 2/3] Do less work when encoding key/value
---
src/backend/access/common/tidstore.c | 25 +++++++++++++++----------
1 file changed, 15 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 5d24680737..3d384cf645 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -159,6 +159,7 @@ typedef struct TidStoreIter
static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off);
static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off);
/*
@@ -367,7 +368,6 @@ void
tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
int num_offsets)
{
- ItemPointerData tid;
uint64 *values;
uint64 key;
uint64 prev_key;
@@ -381,16 +381,12 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
values = palloc(sizeof(uint64) * nkeys);
key = prev_key = key_base;
- ItemPointerSetBlockNumber(&tid, blkno);
-
for (int i = 0; i < num_offsets; i++)
{
uint32 off;
- ItemPointerSetOffsetNumber(&tid, offsets[i]);
-
/* encode the tid to key and val */
- key = tid_to_key_off(ts, &tid, &off);
+ key = encode_key_off(ts, blkno, offsets[i], &off);
/* make sure we scanned the line pointer array in order */
Assert(key >= prev_key);
@@ -681,20 +677,29 @@ key_get_blkno(TidStore *ts, uint64 key)
/* Encode a tid to key and offset */
static inline uint64
tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off)
+{
+ uint32 offset = ItemPointerGetOffsetNumber(tid);
+ BlockNumber block = ItemPointerGetBlockNumber(tid);
+
+ return encode_key_off(ts, block, offset, off);
+}
+
+/* encode a block and offset to a key and partial offset */
+static inline uint64
+encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off)
{
uint64 key;
uint64 tid_i;
if (!ts->control->encode_tids)
{
- *off = ItemPointerGetOffsetNumber(tid);
+ *off = offset;
/* Use the block number as the key */
- return (int64) ItemPointerGetBlockNumber(tid);
+ return (int64) block;
}
- tid_i = ItemPointerGetOffsetNumber(tid);
- tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << ts->control->offset_nbits;
+ tid_i = offset | ((uint64) block << ts->control->offset_nbits);
*off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1);
key = tid_i >> TIDSTORE_VALUE_NBITS;
--
2.39.1
v2699-0001-Miscellaneous-optimizations-for-tidstore_add_t.patch.txttext/plain; charset=US-ASCII; name=v2699-0001-Miscellaneous-optimizations-for-tidstore_add_t.patch.txtDownload
From c0bc497f50318c8e31ccdf0c2a9186ffc736abeb Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 10 Feb 2023 19:56:01 +0700
Subject: [PATCH v2699 1/3] Miscellaneous optimizations for tidstore_add_tids()
- remove palloc0; it's expensive for lengths not known at compile-time
- optimize for case with only one key per heap block
- make some intializations const and branch-free
- when writing to the radix tree, stop at the last non-zero bitmap
---
src/backend/access/common/tidstore.c | 56 ++++++++++++++++++----------
1 file changed, 36 insertions(+), 20 deletions(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 4c72673ce9..5d24680737 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -368,51 +368,67 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
int num_offsets)
{
ItemPointerData tid;
- uint64 key_base;
uint64 *values;
- int nkeys;
+ uint64 key;
+ uint64 prev_key;
+ uint64 off_bitmap = 0;
+ int idx;
+ const uint64 key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+ const int nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
- if (ts->control->encode_tids)
- {
- key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
- nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
- }
- else
- {
- key_base = (uint64) blkno;
- nkeys = 1;
- }
- values = palloc0(sizeof(uint64) * nkeys);
+ values = palloc(sizeof(uint64) * nkeys);
+ key = prev_key = key_base;
ItemPointerSetBlockNumber(&tid, blkno);
+
for (int i = 0; i < num_offsets; i++)
{
- uint64 key;
uint32 off;
- int idx;
ItemPointerSetOffsetNumber(&tid, offsets[i]);
/* encode the tid to key and val */
key = tid_to_key_off(ts, &tid, &off);
- idx = key - key_base;
- Assert(idx >= 0 && idx < nkeys);
+ /* make sure we scanned the line pointer array in order */
+ Assert(key >= prev_key);
- values[idx] |= UINT64CONST(1) << off;
+ if (key > prev_key)
+ {
+ idx = prev_key - key_base;
+ Assert(idx >= 0 && idx < nkeys);
+
+ /* write out offset bitmap for this key */
+ values[idx] = off_bitmap;
+
+ /* zero out any gaps up to the current key */
+ for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
+ values[empty_idx] = 0;
+
+ /* reset for current key -- the current offset will be handled below */
+ off_bitmap = 0;
+ prev_key = key;
+ }
+
+ off_bitmap |= UINT64CONST(1) << off;
}
+ /* save the final index for later */
+ idx = key - key_base;
+ /* write out last offset bitmap */
+ values[idx] = off_bitmap;
+
if (TidStoreIsShared(ts))
LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
/* insert the calculated key-values to the tree */
- for (int i = 0; i < nkeys; i++)
+ for (int i = 0; i <= idx; i++)
{
if (values[i])
{
- uint64 key = key_base + i;
+ key = key_base + i;
if (TidStoreIsShared(ts))
shared_rt_set(ts->tree.shared, key, &values[i]);
--
2.39.1
v2699-0003-Force-all-callers-to-encode-no-matter-how-smal.patch.txttext/plain; charset=US-ASCII; name=v2699-0003-Force-all-callers-to-encode-no-matter-how-smal.patch.txtDownload
From 82c1f639aaa64cc943af3b53294a63d5d8f7a9b9 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sat, 11 Feb 2023 11:51:32 +0700
Subject: [PATCH v2699 3/3] Force all callers to encode, no matter how small
the expected offset
---
src/backend/access/common/tidstore.c | 36 +++++-----------------------
1 file changed, 6 insertions(+), 30 deletions(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 3d384cf645..ff8e66936e 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -99,7 +99,6 @@ typedef struct TidStoreControl
size_t max_bytes; /* the maximum bytes a TidStore can use */
int max_offset; /* the maximum offset number */
int offset_nbits; /* the number of bits required for max_offset */
- bool encode_tids; /* do we use tid encoding? */
int offset_key_nbits; /* the number of bits of a offset number
* used for the key */
@@ -224,21 +223,15 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
ts->control->max_offset = max_offset;
ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+ if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
+ ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
+
/*
* We use tid encoding if the number of bits for the offset number doesn't
* fix in a value, uint64.
*/
- if (ts->control->offset_nbits > TIDSTORE_VALUE_NBITS)
- {
- ts->control->encode_tids = true;
- ts->control->offset_key_nbits =
- ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
- }
- else
- {
- ts->control->encode_tids = false;
- ts->control->offset_key_nbits = 0;
- }
+ ts->control->offset_key_nbits =
+ ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
return ts;
}
@@ -643,12 +636,6 @@ tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
uint64 tid_i;
OffsetNumber off;
- if (i > iter->ts->control->max_offset)
- {
- Assert(!iter->ts->control->encode_tids);
- break;
- }
-
if ((val & (UINT64CONST(1) << i)) == 0)
continue;
@@ -668,10 +655,7 @@ tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
static inline BlockNumber
key_get_blkno(TidStore *ts, uint64 key)
{
- if (ts->control->encode_tids)
- return (BlockNumber) (key >> ts->control->offset_key_nbits);
-
- return (BlockNumber) key;
+ return (BlockNumber) (key >> ts->control->offset_key_nbits);
}
/* Encode a tid to key and offset */
@@ -691,14 +675,6 @@ encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off)
uint64 key;
uint64 tid_i;
- if (!ts->control->encode_tids)
- {
- *off = offset;
-
- /* Use the block number as the key */
- return (int64) block;
- }
-
tid_i = offset | ((uint64) block << ts->control->offset_nbits);
*off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1);
--
2.39.1
On Sat, Feb 11, 2023 at 2:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
I didn't get any closer to radix-tree regression,
Me neither. It seems that in v26, inserting chunks into node-32 is
slow but needs more analysis. I'll share if I found something
interesting.
but I did find some inefficiencies in tidstore_add_tids() that are worth talking about first, addressed in a rough fashion in the attached .txt addendums that I can clean up and incorporate later.
To start, I can reproduce the regression with this test as well:
select * from bench_tidstore_load(0, 10 * 1000 * 1000);
v15 + v26 store + adjustments:
mem_allocated | load_ms
---------------+---------
98202152 | 1676v26 0001-0008
mem_allocated | load_ms
---------------+---------
98202032 | 1826...and reverting to the alternate way to update the parent didn't help:
v26 0001-6, 0008, insert_inner w/ null parent
mem_allocated | load_ms
---------------+---------
98202032 | 1825...and I'm kind of glad that wasn't the problem, because going back to that would be a pain for the shmem case.
Running perf doesn't show anything much different in the proportions (note that rt_set must have been inlined when declared locally in v26):
v15 + v26 store + adjustments:
65.88% postgres postgres [.] tidstore_add_tids
10.74% postgres postgres [.] rt_set
9.20% postgres postgres [.] palloc0
6.49% postgres postgres [.] rt_node_insert_leafv26 0001-0008
78.50% postgres postgres [.] tidstore_add_tids
8.88% postgres postgres [.] palloc0
6.24% postgres postgres [.] local_rt_node_insert_leafv2699-0001: The first thing I noticed is that palloc0 is taking way more time than it should, and it's because the compiler doesn't know the values[] array is small. One reason we need to zero the array is to make the algorithm agnostic about what order the offsets come in, as I requested in a previous review. Thinking some more, I was way too paranoid about that. As long as access methods scan the line pointer array in the usual way, maybe we can just assert that the keys we create are in order, and zero any unused array entries as we find them. (I admit I can't actually think of a reason we would ever encounter offsets out of order.)
I can think that something like traversing a HOT chain could visit
offsets out of order. But fortunately we prune such collected TIDs
before heap vacuum in heap case.
Also, we can keep track of the last key we need to consider for insertion into the radix tree, and ignore the rest. That might shave a few cycles during the exclusive lock when the max offset of an LP_DEAD item < 64 on a given page, which I think would be common in the wild. I also got rid of the special case for non-encoding, since shifting by zero should work the same way. These together led to a nice speedup on the v26 branch:
mem_allocated | load_ms
---------------+---------
98202032 | 1386v2699-0002: The next thing I noticed is forming a full ItemIdPointer to pass to tid_to_key_off(). That's bad for tidstore_add_tids() because ItemPointerSetBlockNumber() must do this in order to allow the struct to be SHORTALIGN'd:
static inline void
BlockIdSet(BlockIdData *blockId, BlockNumber blockNumber)
{
blockId->bi_hi = blockNumber >> 16;
blockId->bi_lo = blockNumber & 0xffff;
}Then, tid_to_key_off() calls ItemPointerGetBlockNumber(), which must reverse the above process:
static inline BlockNumber
BlockIdGetBlockNumber(const BlockIdData *blockId)
{
return (((BlockNumber) blockId->bi_hi) << 16) | ((BlockNumber) blockId->bi_lo);
}There is no reason to do any of this if we're not reading/writing directly to/from an on-disk tid etc. To avoid this, I created a new function encode_key_off() [name could be better], which deals with the raw block number that we already have. Then turn tid_to_key_off() into a wrapper around that, since we still need the full conversion for tidstore_lookup_tid().
v2699-0003: Get rid of all the remaining special cases for encoding/or not. I am unaware of the need to optimize that case or treat it in any way differently. I haven't tested this on an installation with non-default blocksize and didn't measure this separately, but 0002+0003 gives:
mem_allocated | load_ms
---------------+---------
98202032 | 1259If these are acceptable, I can incorporate them into a later patchset.
These are nice improvements! I agree with all changes.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Mon, Feb 13, 2023 at 2:51 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Sat, Feb 11, 2023 at 2:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:I didn't get any closer to radix-tree regression,
Me neither. It seems that in v26, inserting chunks into node-32 is
slow but needs more analysis. I'll share if I found something
interesting.
If that were the case, then the other benchmarks I ran would likely have
slowed down as well, but they are the same or faster. There is one
microbenchmark I didn't run before: "select * from
bench_fixed_height_search(15)" (15 to reduce noise from growing size class,
and despite the name it measures load time as well). Trying this now shows
no difference: a few runs range 19 to 21ms in each version. That also
reinforces that update_inner is fine and that the move to value pointer API
didn't regress.
Changing TIDS_PER_BLOCK_FOR_LOAD to 1 to stress the tree more gives (min of
5, perf run separate from measurements):
v15 + v26 store:
mem_allocated | load_ms
---------------+---------
98202152 | 553
19.71% postgres postgres [.] tidstore_add_tids
+ 31.47% postgres postgres [.] rt_set
= 51.18%
20.62% postgres postgres [.] rt_node_insert_leaf
6.05% postgres postgres [.] AllocSetAlloc
4.74% postgres postgres [.] AllocSetFree
4.62% postgres postgres [.] palloc
2.23% postgres postgres [.] SlabAlloc
v26:
mem_allocated | load_ms
---------------+---------
98202032 | 617
57.45% postgres postgres [.] tidstore_add_tids
20.67% postgres postgres [.] local_rt_node_insert_leaf
5.99% postgres postgres [.] AllocSetAlloc
3.55% postgres postgres [.] palloc
3.05% postgres postgres [.] AllocSetFree
2.05% postgres postgres [.] SlabAlloc
So it seems the store itself got faster when we removed shared memory paths
from the v26 store to test it against v15.
I thought to favor the local memory case in the tidstore by controlling
inlining -- it's smaller and will be called much more often, so I tried the
following (done in 0007)
#define RT_PREFIX shared_rt
#define RT_SHMEM
-#define RT_SCOPE static
+#define RT_SCOPE static pg_noinline
That brings it down to
mem_allocated | load_ms
---------------+---------
98202032 | 590
That's better, but not still not within noise level. Perhaps some slowdown
is unavoidable, but it would be nice to understand why.
I can think that something like traversing a HOT chain could visit
offsets out of order. But fortunately we prune such collected TIDs
before heap vacuum in heap case.
Further, currently we *already* assume we populate the tid array in order
(for binary search), so we can just continue assuming that (with an assert
added since it's more public in this form). I'm not sure why such basic
common sense evaded me a few versions ago...
If these are acceptable, I can incorporate them into a later patchset.
These are nice improvements! I agree with all changes.
Great, I've squashed these into the tidstore patch (0004). Also added 0005,
which is just a simplification.
I squashed the earlier dead code removal into the radix tree patch.
v27-0008 measures tid store iteration performance and adds a stub function
to prevent spurious warnings, so the benchmarking module can always be
built.
Getting the list of offsets from the old array for a given block is always
trivial, but tidstore_iter_extract_tids() is doing a huge amount of
unnecessary work when TIDS_PER_BLOCK_FOR_LOAD is 1, enough to exceed the
load time:
mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 589 | 915
Fortunately, it's an easy fix, done in 0009.
mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 589 | 153
I'll soon resume more cosmetic review of the tid store, but this is enough
to post.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v27-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchtext/x-patch; charset=US-ASCII; name=v27-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchDownload
From d577ef9d9755e7ca4d3722c1a044381a81d66244 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v27 1/9] Introduce helper SIMD functions for small byte arrays
vector8_min - helper for emulating ">=" semantics
vector8_highbit_mask - used to turn the result of a vector
comparison into a bitmask
Masahiko Sawada
Reviewed by Nathan Bossart, additional adjustments by me
Discussion: https://www.postgresql.org/message-id/CAD21AoDap240WDDdUDE0JMpCmuMMnGajrKrkCRxM7zn9Xk3JRA%40mail.gmail.com
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..350e2caaea 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -79,6 +79,7 @@ static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#endif
/* arithmetic operations */
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -299,6 +301,36 @@ vector32_is_highbit_set(const Vector32 v)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Return a bitmask formed from the high-bit of each element.
+ */
+#ifndef USE_NO_SIMD
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ /*
+ * Note: There is a faster way to do this, but it returns a uint64 and
+ * and if the caller wanted to extract the bit position using CTZ,
+ * it would have to divide that result by 4.
+ */
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
/*
* Return the bitwise OR of the inputs
*/
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Given two vectors, return a vector with the minimum element of each.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.39.1
v27-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v27-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload
From 149a49f51f7a16b7c1eb762e704f1ec476ecb65a Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v27 2/9] Move some bitmap logic out of bitmapset.c
Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.
Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().
Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.
Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
src/backend/nodes/bitmapset.c | 36 ++------------------------------
src/include/nodes/bitmapset.h | 16 ++++++++++++--
src/include/port/pg_bitutils.h | 31 +++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 -
4 files changed, 47 insertions(+), 37 deletions(-)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x))
/*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
{
int result;
- w = RIGHTMOST_ONE(w);
+ w = bmw_rightmost_one(w);
a->words[wordnum] &= ~w;
result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 3d2225e1ae..5f9a511b4a 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
#else
#define BITS_PER_BITMAPWORD 32
typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
#endif
@@ -75,6 +73,20 @@ typedef enum
BMS_MULTIPLE /* >1 member */
} BMS_Membership;
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w) pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
+#define bmw_popcount(w) pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w) pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
+#define bmw_popcount(w) pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
/*
* function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+ int32 result = (int32) word & -((int32) word);
+ return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+ int64 result = (int64) word & -((int64) word);
+ return (uint64) result;
+}
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 36d1dc0117..a0c60feade 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3669,7 +3669,6 @@ shmem_request_hook_type
shmem_startup_hook_type
sig_atomic_t
sigjmp_buf
-signedbitmapword
sigset_t
size_t
slist_head
--
2.39.1
v27-0005-Do-bitmap-conversion-in-one-place-rather-than-fo.patchtext/x-patch; charset=US-ASCII; name=v27-0005-Do-bitmap-conversion-in-one-place-rather-than-fo.patchDownload
From dba9497b5b587da873fbb2de89570ec8b36d604b Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 12 Feb 2023 15:17:40 +0700
Subject: [PATCH v27 5/9] Do bitmap conversion in one place rather than forcing
callers to do it
---
src/backend/access/common/tidstore.c | 31 +++++++++++++++-------------
1 file changed, 17 insertions(+), 14 deletions(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index ff8e66936e..ad8c0866e2 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -70,6 +70,7 @@
* and value, respectively.
*/
#define TIDSTORE_VALUE_NBITS 6 /* log(64, 2) */
+#define TIDSTORE_OFFSET_MASK ((1 << TIDSTORE_VALUE_NBITS) - 1)
/* A magic value used to identify our TidStores. */
#define TIDSTORE_MAGIC 0x826f6a10
@@ -158,8 +159,8 @@ typedef struct TidStoreIter
static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
-static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off);
-static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off);
+static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit);
/*
* Create a TidStore. The returned object is allocated in backend-local memory.
@@ -376,10 +377,10 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
for (int i = 0; i < num_offsets; i++)
{
- uint32 off;
+ uint64 off_bit;
/* encode the tid to key and val */
- key = encode_key_off(ts, blkno, offsets[i], &off);
+ key = encode_key_off(ts, blkno, offsets[i], &off_bit);
/* make sure we scanned the line pointer array in order */
Assert(key >= prev_key);
@@ -401,7 +402,7 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
prev_key = key;
}
- off_bitmap |= UINT64CONST(1) << off;
+ off_bitmap |= off_bit;
}
/* save the final index for later */
@@ -441,10 +442,10 @@ tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
{
uint64 key;
uint64 val = 0;
- uint32 off;
+ uint64 off_bit;
bool found;
- key = tid_to_key_off(ts, tid, &off);
+ key = tid_to_key_off(ts, tid, &off_bit);
if (TidStoreIsShared(ts))
found = shared_rt_search(ts->tree.shared, key, &val);
@@ -454,7 +455,7 @@ tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
if (!found)
return false;
- return (val & (UINT64CONST(1) << off)) != 0;
+ return (val & off_bit) != 0;
}
/*
@@ -660,26 +661,28 @@ key_get_blkno(TidStore *ts, uint64 key)
/* Encode a tid to key and offset */
static inline uint64
-tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off)
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit)
{
uint32 offset = ItemPointerGetOffsetNumber(tid);
BlockNumber block = ItemPointerGetBlockNumber(tid);
- return encode_key_off(ts, block, offset, off);
+ return encode_key_off(ts, block, offset, off_bit);
}
/* encode a block and offset to a key and partial offset */
static inline uint64
-encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off)
+encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit)
{
uint64 key;
uint64 tid_i;
+ uint32 off_lower;
- tid_i = offset | ((uint64) block << ts->control->offset_nbits);
+ off_lower = offset & TIDSTORE_OFFSET_MASK;
+ Assert(off_lower < (sizeof(uint64) * BITS_PER_BYTE));
- *off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1);
+ *off_bit = UINT64CONST(1) << off_lower;
+ tid_i = offset | ((uint64) block << ts->control->offset_nbits);
key = tid_i >> TIDSTORE_VALUE_NBITS;
- Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
return key;
}
--
2.39.1
v27-0003-Add-radixtree-template.patchtext/x-patch; charset=US-ASCII; name=v27-0003-Add-radixtree-template.patchDownload
From bf9d659187537b250683af321b0167d69c7fb18a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v27 3/9] Add radixtree template
WIP: commit message based on template comments
---
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 2516 +++++++++++++++++
src/include/lib/radixtree_delete_impl.h | 122 +
src/include/lib/radixtree_insert_impl.h | 328 +++
src/include/lib/radixtree_iter_impl.h | 153 +
src/include/lib/radixtree_search_impl.h | 138 +
src/include/utils/dsa.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 36 +
src/test/modules/test_radixtree/meson.build | 35 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 674 +++++
.../test_radixtree/test_radixtree.control | 4 +
src/tools/pginclude/cpluspluscheck | 6 +
src/tools/pginclude/headerscheck | 6 +
20 files changed, 4082 insertions(+)
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/include/lib/radixtree_delete_impl.h
create mode 100644 src/include/lib/radixtree_insert_impl.h
create mode 100644 src/include/lib/radixtree_iter_impl.h
create mode 100644 src/include/lib/radixtree_search_impl.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..80555aefff 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..1cdb995e54
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2516 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Template for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ * tional leaf node type which stores one value.
+ * - Multi-value leaves: The values are stored in one of four
+ * different leaf node types, which mirror the structure of
+ * inner nodes, but contain values instead of pointers.
+ * - Combined pointer/value slots: If values fit into point-
+ * ers, no separate node types are necessary. Instead, each
+ * pointer storage location in an inner node can either
+ * store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * To handle concurrency, we use a single reader-writer lock for the radix
+ * tree. The radix tree is exclusively locked during write operations such
+ * as RT_SET() and RT_DELETE(), and shared locked during read operations
+ * such as RT_SEARCH(). An iteration also holds the shared lock on the radix
+ * tree until it is completed.
+ *
+ * TODO: The current locking mechanism is not optimized for high concurrency
+ * with mixed read-write workloads. In the future it might be worthwhile
+ * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
+ * the paper "The ART of Practical Synchronization" by the same authors as
+ * the ART paper, 2016.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included. Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * will result in radix tree type 'foo_radix_tree' and functions like
+ * 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ * generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ * declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
+ *
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ * so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE - Create a new, empty radix tree
+ * RT_FREE - Free the radix tree
+ * RT_SEARCH - Search a key-value pair
+ * RT_SET - Set a key-value pair
+ * RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT - Return next key-value pair, if any
+ * RT_END_ITER - End iteration
+ * RT_MEMORY_USAGE - Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH - Attach to the radix tree
+ * RT_DETACH - Detach from the radix tree
+ * RT_GET_HANDLE - Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE - Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif /* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define RT_BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
+#define RT_BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ * statements.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ * in the future to tag the node pointer with the kind, even on
+ * platforms with 32-bit pointers. This might speed up node traversal
+ * in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_3 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Max capacity for the current size class. Storing this in the
+ * node enables multiple size classes per node kind.
+ * Technically, kinds with a single size class don't need this, so we could
+ * keep this in the individual base types, but the code is simpler this way.
+ * Note: node256 is unique in that it cannot possibly have more than a
+ * single size class, so for that kind we store zero, and uint8 is
+ * sufficient for other kinds.
+ */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree) LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree) LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree) LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree) ((void) 0)
+#define RT_LOCK_SHARED(tree) ((void) 0)
+#define RT_UNLOCK(tree) ((void) 0)
+#endif
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+#define RT_NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
+
+#define RT_NODE_MUST_GROW(node) \
+ ((node)->base.n.count == (node)->base.n.fanout)
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_3
+{
+ RT_NODE n;
+
+ /* 3 children, for key chunks */
+ uint8 chunks[3];
+} RT_NODE_BASE_3;
+
+typedef struct RT_NODE_BASE_32
+{
+ RT_NODE n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_125
+{
+ RT_NODE n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* bitmap to track which slots are in use */
+ bitmapword isset[RT_BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+ RT_NODE n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * These are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_3
+{
+ RT_NODE_BASE_3 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_3;
+
+typedef struct RT_NODE_LEAF_3
+{
+ RT_NODE_BASE_3 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_3;
+
+typedef struct RT_NODE_INNER_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+ RT_NODE_BASE_256 base;
+
+ /* Slots for 256 children */
+ RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+ RT_NODE_BASE_256 base;
+
+ /*
+ * Unlike with inner256, zero is a valid value here, so we use a
+ * bitmap to track which slots are in use.
+ */
+ bitmapword isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ RT_VALUE_TYPE values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+ RT_CLASS_3 = 0,
+ RT_CLASS_32_MIN,
+ RT_CLASS_32_MAX,
+ RT_CLASS_125,
+ RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+} RT_SIZE_CLASS_ELEM;
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+ [RT_CLASS_3] = {
+ .name = "radix tree node 3",
+ .fanout = 3,
+ .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_32_MIN] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_32_MAX] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_125] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(RT_NODE_INNER_256),
+ .leaf_size = sizeof(RT_NODE_LEAF_256),
+ },
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+ RT_HANDLE handle;
+ uint32 magic;
+ LWLock lock;
+#endif
+
+ RT_PTR_ALLOC root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+ MemoryContext context;
+
+ /* pointing to either local memory or DSA */
+ RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+ dsa_area *dsa;
+#else
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+ RT_PTR_LOCAL node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+ RT_RADIX_TREE *tree;
+
+ /* Track the iteration on nodes of each level */
+ RT_NODE_ITER stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is constructed during iteration */
+ uint64 key;
+} RT_ITER;
+
+
+static void RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_VALUE_TYPE *value_p);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+ return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+ return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+ return DsaPointerIsValid(ptr);
+#else
+ return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ /* replicate the search key */
+ spread_chunk = vector8_broadcast(chunk);
+
+ /* compare to all 32 keys stored in the node */
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+
+ /* convert comparison to a bitfield */
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+ /* mask off invalid entries */
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ /* convert bitfield to index by counting trailing zeros */
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ /*
+ * This is coded with '>=' to match what we can do with SIMD,
+ * with an assert to keep us honest.
+ */
+ if (node->chunks[index] >= chunk)
+ {
+ Assert(node->chunks[index] != chunk);
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ /*
+ * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+ * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+ * we need to play some trickery using vector8_min() to effectively get
+ * >=. There'll never be any equal elements in current uses, but that's
+ * what we get here...
+ */
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+ uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+ uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+ Assert(RT_NODE_IS_LEAF(node));
+ Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+ return node->children[chunk];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ Assert(RT_NODE_IS_LEAF(node));
+ Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ node->isset[idx] |= ((bitmapword) 1 << bitnum);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+ if (key == 0)
+ return 0;
+ else
+ return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+ RT_PTR_ALLOC allocnode;
+ size_t allocsize;
+
+ if (is_leaf)
+ allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+ else
+ allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+ allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+ if (is_leaf)
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ allocsize);
+ else
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+ allocsize);
+#endif
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->ctl->cnt[size_class]++;
+#endif
+
+ return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+ if (is_leaf)
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+ else
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+
+ node->kind = kind;
+
+ if (kind == RT_NODE_KIND_256)
+ /* See comment for the RT_NODE type */
+ Assert(node->fanout == 0);
+ else
+ node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+ memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
+ }
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static pg_noinline void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+ int shift = RT_KEY_GET_SHIFT(key);
+ bool is_leaf = shift == 0;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ newnode->shift = shift;
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+ tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->count = oldnode->count;
+}
+
+/*
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
+ */
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+ uint8 new_kind, uint8 new_class, bool is_leaf)
+{
+ RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
+ RT_COPY_NODE(newnode, node);
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->ctl->root == allocnode)
+ {
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+ tree->ctl->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->ctl->cnt[i]--;
+ Assert(tree->ctl->cnt[i] >= 0);
+ }
+#endif
+
+#ifdef RT_SHMEM
+ dsa_free(tree->dsa, allocnode);
+#else
+ pfree(allocnode);
+#endif
+}
+
+/* Update the parent's pointer when growing a node */
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static inline void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
+ RT_PTR_ALLOC new_child, uint64 key)
+{
+#ifdef USE_ASSERT_CHECKING
+ RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+ Assert(old_child->shift == new->shift);
+ Assert(old_child->count == new->count);
+#endif
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new larger node */
+ tree->ctl->root = new_child;
+ }
+ else
+ RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+ RT_FREE_NODE(tree, stored_old_child);
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static pg_noinline void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+ int target_shift;
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ int shift = root->shift + RT_NODE_SPAN;
+
+ target_shift = RT_KEY_GET_SHIFT(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ RT_NODE_INNER_3 *n3;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+ node->shift = shift;
+ node->count = 1;
+
+ n3 = (RT_NODE_INNER_3 *) node;
+ n3->base.chunks[0] = 0;
+ n3->children[0] = tree->ctl->root;
+
+ /* Update the root */
+ tree->ctl->root = allocnode;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static pg_noinline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
+{
+ int shift = node->shift;
+
+ Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ RT_PTR_ALLOC allocchild;
+ RT_PTR_LOCAL newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool is_leaf = newshift == 0;
+
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ newchild->shift = newshift;
+ RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
+
+ parent = node;
+ node = newchild;
+ stored_node = allocchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value_p);
+ tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static void
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+ RT_RADIX_TREE *tree;
+ MemoryContext old_ctx;
+#ifdef RT_SHMEM
+ dsa_pointer dp;
+#endif
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+ tree->context = ctx;
+
+#ifdef RT_SHMEM
+ tree->dsa = dsa;
+ dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+ tree->ctl->handle = dp;
+ tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+ LWLockInitialize(&tree->ctl->lock, tranche_id);
+#else
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+ /* Create a slab context for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+ size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+ size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ size_class.name,
+ inner_blocksize,
+ size_class.inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ size_class.name,
+ leaf_blocksize,
+ size_class.leaf_size);
+ }
+#endif
+
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+ RT_RADIX_TREE *tree;
+ dsa_pointer control;
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ tree->dsa = dsa;
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();
+
+ /* The leaf node doesn't have child pointers */
+ if (RT_NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->dsa, ptr);
+ return;
+ }
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+ for (int i = 0; i < n3->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n3->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ for (int i = 0; i < n32->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+ }
+
+ break;
+ }
+ }
+
+ /* Free the inner node */
+ dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ /* Free all memory used for radix tree nodes */
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_FREE_RECURSE(tree, tree->ctl->root);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->dsa, tree->ctl->handle);
+#else
+ pfree(tree->ctl);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+#endif
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+ int shift;
+ bool updated;
+ RT_PTR_LOCAL parent;
+ RT_PTR_ALLOC stored_child;
+ RT_PTR_LOCAL child;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ RT_LOCK_EXCLUSIVE(tree);
+
+ /* Empty tree, create the root */
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_NEW_ROOT(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->ctl->max_val)
+ RT_EXTEND(tree, key);
+
+ stored_child = tree->ctl->root;
+ parent = RT_PTR_GET_LOCAL(tree, stored_child);
+ shift = parent->shift;
+
+ /* Descend the tree until we reach a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;;
+
+ child = RT_PTR_GET_LOCAL(tree, stored_child);
+
+ if (RT_NODE_IS_LEAF(child))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
+ {
+ RT_SET_EXTEND(tree, key, value_p, parent, stored_child, child);
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ parent = child;
+ stored_child = new_child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value_p);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->ctl->num_keys++;
+
+ RT_UNLOCK(tree);
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *value_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+ bool found;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+ Assert(value_p != NULL);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ shift = node->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+ if (RT_NODE_IS_LEAF(node))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ node = RT_PTR_GET_LOCAL(tree, child);
+ shift -= RT_NODE_SPAN;
+ }
+
+ found = RT_NODE_SEARCH_LEAF(node, key, value_p);
+
+ RT_UNLOCK(tree);
+ return found;
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ RT_LOCK_EXCLUSIVE(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+ /* Push the current node to the stack */
+ stack[++level] = allocnode;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ allocnode = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_LEAF(node, key);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->ctl->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (node->count > 0)
+ {
+ RT_UNLOCK(tree);
+ return true;
+ }
+
+ /* Free the empty leaf node */
+ RT_FREE_NODE(tree, allocnode);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ allocnode = stack[level--];
+
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_INNER(node, key);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (node->count > 0)
+ break;
+
+ /* The node became empty */
+ RT_FREE_NODE(tree, allocnode);
+ }
+
+ RT_UNLOCK(tree);
+ return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+ RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+ int level = from;
+ RT_PTR_LOCAL node = from_node;
+
+ for (;;)
+ {
+ RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (RT_NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/*
+ * Create and return the iterator for the given radix tree.
+ *
+ * The radix tree is locked in shared mode during the iteration, so
+ * RT_END_ITERATE needs to be called when finished to release the lock.
+ */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+ MemoryContext old_ctx;
+ RT_ITER *iter;
+ RT_PTR_LOCAL root;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+ iter->tree = tree;
+
+ RT_LOCK_SHARED(tree);
+
+ /* empty tree */
+ if (!iter->tree->ctl->root)
+ return iter;
+
+ root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+ top_level = root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->ctl->root)
+ return false;
+
+ for (;;)
+ {
+ RT_PTR_LOCAL child = NULL;
+ RT_VALUE_TYPE value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+/*
+ * Terminate the iteration and release the lock.
+ *
+ * This function needs to be called after finishing or when exiting an
+ * iteration.
+ */
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+#ifdef RT_SHMEM
+ Assert(LWLockHeldByMe(&iter->tree->ctl->lock));
+#endif
+
+ RT_UNLOCK(iter->tree);
+ pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+ Size total = 0;
+
+ RT_LOCK_SHARED(tree);
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ total = dsa_get_total_size(tree->dsa);
+#else
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+#endif
+
+ RT_UNLOCK(tree);
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
+
+ for (int i = 1; i < n3->n.count; i++)
+ Assert(n3->chunks[i - 1] < n3->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ uint8 slot = n125->slot_idxs[i];
+ int idx = RT_BM_IDX(slot);
+ int bitnum = RT_BM_BIT(slot);
+
+ if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(slot < node->fanout);
+ Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
+ cnt += bmw_popcount(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+ RT_LOCK_SHARED(tree);
+
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+ fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+ fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+
+ fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+ root->shift / RT_NODE_SPAN,
+ tree->ctl->cnt[RT_CLASS_3],
+ tree->ctl->cnt[RT_CLASS_32_MIN],
+ tree->ctl->cnt[RT_CLASS_32_MAX],
+ tree->ctl->cnt[RT_CLASS_125],
+ tree->ctl->cnt[RT_CLASS_256]);
+ }
+
+ RT_UNLOCK(tree);
+}
+
+static void
+RT_DUMP_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, int level,
+ bool recurse, StringInfo buf)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+ StringInfoData spaces;
+
+ initStringInfo(&spaces);
+ appendStringInfoSpaces(&spaces, (level * 4) + 1);
+
+ appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u, shift %u:\n",
+ spaces.data,
+ level == 0 ? "" : "-> ",
+ RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_3) ? 3 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+ spaces.data, i, n3->base.chunks[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X",
+ spaces.data, i, n3->base.chunks[i]);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, n3->children[i], level + 1,
+ recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+ spaces.data, i, n32->base.chunks[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X",
+ spaces.data, i, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, n32->children[i], level + 1,
+ recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+ char *sep = "";
+
+ appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ appendStringInfo(buf, "%s[%d]=%d ",
+ sep, i, b125->slot_idxs[i]);
+ sep = ",";
+ }
+
+ appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+ for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+ appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+ appendStringInfo(buf, "\n");
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ if (RT_NODE_IS_LEAF(node))
+ appendStringInfo(buf, "%schunk 0x%X\n",
+ spaces.data, i);
+ else
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+ appendStringInfo(buf, "%schunk 0x%X",
+ spaces.data, i);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i),
+ level + 1, recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+ for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+ appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+ appendStringInfo(buf, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ appendStringInfo(buf, "%schunk 0x%X\n",
+ spaces.data, i);
+ }
+ else
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ appendStringInfo(buf, "%schunk 0x%X",
+ spaces.data, i);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i),
+ level + 1, recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ StringInfoData buf;
+ int shift;
+ int level = 0;
+
+ RT_STATS(tree);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ if (key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+ key, key);
+ return;
+ }
+
+ initStringInfo(&buf);
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child;
+
+ RT_DUMP_NODE(tree, allocnode, level, false, &buf);
+
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_VALUE_TYPE dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+ break;
+ }
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ break;
+
+ allocnode = child;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+ RT_UNLOCK(tree);
+
+ fprintf(stderr, "%s", buf.data);
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+ StringInfoData buf;
+
+ RT_STATS(tree);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ initStringInfo(&buf);
+
+ RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+ RT_UNLOCK(tree);
+
+ fprintf(stderr, "%s",buf.data);
+}
+#endif
+
+#endif /* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef RT_BM_IDX
+#undef RT_BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_3
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_3
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_3
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
+#undef RT_CLASS_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_SWITCH_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..5f6dda1f12
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,122 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_delete_impl.h
+ * Common implementation for deletion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ * TODO: Shrink nodes when deletion would allow them to fit in a smaller
+ * size class.
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_delete_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+ n3->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+ n3->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
+ n32->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int idx;
+ int bitnum;
+
+ if (slotpos == RT_INVALID_SLOT_IDX)
+ return false;
+
+ idx = RT_BM_IDX(slotpos);
+ bitnum = RT_BM_BIT(slotpos);
+ n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+ n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+ RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+ break;
+ }
+ }
+
+ /* update statistics */
+ node->count--;
+
+ return true;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..d56e58dcac
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,328 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_insert_impl.h
+ * Common implementation for insertion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_insert_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ const bool is_leaf = true;
+ bool chunk_exists = false;
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+ const bool is_leaf = false;
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ int idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n3->values[idx] = *value_p;
+ break;
+ }
+#endif
+ if (unlikely(RT_NODE_MUST_GROW(n3)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE32_TYPE *new32;
+ const uint8 new_kind = RT_NODE_KIND_32;
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
+
+ /* grow node from 3 to 32 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
+ new32->base.chunks, new32->values);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
+ new32->base.chunks, new32->children);
+#endif
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+ int count = n3->base.n.count;
+
+ /* shift chunks and children */
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
+ count, insertpos);
+#endif
+ }
+
+ n3->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n3->values[insertpos] = *value_p;
+#else
+ n3->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ int idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->values[idx] = *value_p;
+ break;
+ }
+#endif
+ if (unlikely(RT_NODE_MUST_GROW(n32)) &&
+ n32->base.n.fanout < class32_max.fanout)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+ Assert(n32->base.n.fanout == class32_min.fanout);
+
+ /* grow to the next size class of this kind */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ n32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ memcpy(newnode, node, class32_min.leaf_size);
+#else
+ memcpy(newnode, node, class32_min.inner_size);
+#endif
+ newnode->fanout = class32_max.fanout;
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n32)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE125_TYPE *new125;
+ const uint8 new_kind = RT_NODE_KIND_125;
+ const RT_SIZE_CLASS new_class = RT_CLASS_125;
+
+ Assert(n32->base.n.fanout == class32_max.fanout);
+
+ /* grow node from 32 to 125 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new125 = (RT_NODE125_TYPE *) newnode;
+
+ for (int i = 0; i < class32_max.fanout; i++)
+ {
+ new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ new125->values[i] = n32->values[i];
+#else
+ new125->children[i] = n32->children[i];
+#endif
+ }
+
+ /*
+ * Since we just copied a dense array, we can set the bits
+ * using a single store, provided the length of that array
+ * is at most the number of bits in a bitmapword.
+ */
+ Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+ new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+ int count = n32->base.n.count;
+
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+ count, insertpos);
+#endif
+ }
+
+ n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = *value_p;
+#else
+ n32->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos;
+ int cnt = 0;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ slotpos = n125->base.slot_idxs[chunk];
+ if (slotpos != RT_INVALID_SLOT_IDX)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n125->values[slotpos] = *value_p;
+ break;
+ }
+#endif
+ if (unlikely(RT_NODE_MUST_GROW(n125)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE256_TYPE *new256;
+ const uint8 new_kind = RT_NODE_KIND_256;
+ const RT_SIZE_CLASS new_class = RT_CLASS_256;
+
+ /* grow node from 125 to 256 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new256 = (RT_NODE256_TYPE *) newnode;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+ RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ cnt++;
+ }
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int idx;
+ bitmapword inverse;
+
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < RT_BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+ {
+ if (n125->base.isset[idx] < ~((bitmapword) 0))
+ break;
+ }
+
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(n125->base.isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+ Assert(slotpos < node->fanout);
+
+ /* mark the slot used */
+ n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+ n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = *value_p;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+ Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
+ RT_NODE_LEAF_256_SET(n256, chunk, *value_p);
+#else
+ Assert(node->count < RT_NODE_MAX_SLOTS);
+ RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+ break;
+ }
+ }
+
+ /* Update statistics */
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!chunk_exists)
+ node->count++;
+#else
+ node->count++;
+#endif
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ RT_VERIFY_NODE(node);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return chunk_exists;
+#else
+ return;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..98c78eb237
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,153 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_iter_impl.h
+ * Common implementation for iteration in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_iter_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ bool found = false;
+ uint8 key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_VALUE_TYPE value;
+
+ Assert(RT_NODE_IS_LEAF(node_iter->node));
+#else
+ RT_PTR_LOCAL child = NULL;
+
+ Assert(!RT_NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+ Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n3->base.n.count)
+ break;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n3->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+#endif
+ key_chunk = n3->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+#ifdef RT_NODE_LEVEL_LEAF
+ if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+ if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = value;
+#endif
+ }
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return found;
+#else
+ return child;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..a8925c75d0
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,138 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_search_impl.h
+ * Common implementation for search in leaf and inner nodes, plus
+ * update for inner nodes only.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_search_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(value_p != NULL);
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+ Assert(child_p != NULL);
+#endif
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n3->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = n3->values[idx];
+#else
+ *child_p = n3->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n32->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = n32->values[idx];
+#else
+ *child_p = n32->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+ Assert(slotpos != RT_INVALID_SLOT_IDX);
+ n125->children[slotpos] = new_child;
+#else
+ if (slotpos == RT_INVALID_SLOT_IDX)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+ *child_p = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+ RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+ *child_p = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ }
+
+#ifdef RT_ACTION_UPDATE
+ return;
+#else
+ return true;
+#endif /* RT_ACTION_UPDATE */
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..2af215484f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,6 +121,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
test_pg_db_role_setting \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
subdir('test_pg_db_role_setting')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..f944945db9
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,674 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int rt_node_kind_fanouts[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 125, /* RT_NODE_KIND_125 */
+ 256 /* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ TestValueType dummy;
+ uint64 key;
+ TestValueType val;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ rt_radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* look up keys */
+ for (int i = 0; i < children; i++)
+ {
+ TestValueType value;
+
+ if (!rt_search(radixtree, keys[i], &value))
+ elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (value != (TestValueType) keys[i])
+ elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+ value, (TestValueType) keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ TestValueType update = keys[i] + 1;
+ if (!rt_set(radixtree, keys[i], (TestValueType*) &update))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ TestValueType val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != (TestValueType) key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, (TestValueType*) &key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx - 1]
+ : rt_node_kind_fanouts[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx]
+ : rt_node_kind_fanouts[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+#else
+ radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, (TestValueType*) &x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ TestValueType v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != (TestValueType) x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ TestValueType val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != (TestValueType) expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ TestValueType v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ rt_free(radixtree);
+ MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ test_basic(rt_node_kind_fanouts[i], false);
+ test_basic(rt_node_kind_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
--
2.39.1
v27-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchtext/x-patch; charset=US-ASCII; name=v27-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload
From 1be520c83274bc3a2f068689e665c254c8e3c04e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v27 4/9] Add TIDStore, to store sets of TIDs (ItemPointerData)
efficiently.
The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.
The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.
This includes a unit test module, in src/test/modules/test_tidstore.
---
doc/src/sgml/monitoring.sgml | 4 +
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 685 ++++++++++++++++++
src/backend/storage/lmgr/lwlock.c | 2 +
src/include/access/tidstore.h | 49 ++
src/include/storage/lwlock.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_tidstore/Makefile | 23 +
.../test_tidstore/expected/test_tidstore.out | 13 +
src/test/modules/test_tidstore/meson.build | 35 +
.../test_tidstore/sql/test_tidstore.sql | 7 +
.../test_tidstore/test_tidstore--1.0.sql | 8 +
.../modules/test_tidstore/test_tidstore.c | 195 +++++
.../test_tidstore/test_tidstore.control | 4 +
16 files changed, 1030 insertions(+)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
create mode 100644 src/test/modules/test_tidstore/Makefile
create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
create mode 100644 src/test/modules/test_tidstore/meson.build
create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
create mode 100644 src/test/modules/test_tidstore/test_tidstore.control
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b246ddc634..e44387d2c1 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2192,6 +2192,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Waiting to access a shared TID bitmap during a parallel bitmap
index scan.</entry>
</row>
+ <row>
+ <entry><literal>SharedTidStore</literal></entry>
+ <entry>Waiting to access a shared TID store.</entry>
+ </row>
<row>
<entry><literal>SharedTupleStore</literal></entry>
<entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..ff8e66936e
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,685 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach().
+ *
+ * Regarding the concurrency, it basically relies on the concurrency support in
+ * the radix tree, but we acquires the lock on a TidStore in some cases, for
+ * example, when to reset the store and when to access the number tids in the
+ * store (num_tids).
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, tids are represented as a pair of 64-bit key and
+ * 64-bit value. First, we construct 64-bit unsigned integer by combining
+ * the block number and the offset number. The number of bits used for the
+ * offset number is specified by max_offsets in tidstore_create(). We are
+ * frugal with the bits, because smaller keys could help keeping the radix
+ * tree shallow.
+ *
+ * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. That
+ * is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits
+ * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
+ * as the key:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ * |----| value
+ * |---------------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ *
+ * If the number of bits for offset number fits in a 64-bit value, we don't
+ * encode tids but directly use the block number and the offset number as key
+ * and value, respectively.
+ */
+#define TIDSTORE_VALUE_NBITS 6 /* log(64, 2) */
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The control object for a TidStore */
+typedef struct TidStoreControl
+{
+ /* the number of tids in the store */
+ int64 num_tids;
+
+ /* These values are never changed after creation */
+ size_t max_bytes; /* the maximum bytes a TidStore can use */
+ int max_offset; /* the maximum offset number */
+ int offset_nbits; /* the number of bits required for max_offset */
+ int offset_key_nbits; /* the number of bits of a offset number
+ * used for the key */
+
+ /* The below fields are used only in shared case */
+
+ uint32 magic;
+ LWLock lock;
+
+ /* handles for TidStore and radix tree */
+ tidstore_handle handle;
+ shared_rt_handle tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+ /*
+ * Control object. This is allocated in DSA area 'area' in the shared
+ * case, otherwise in backend-local memory.
+ */
+ TidStoreControl *control;
+
+ /* Storage for Tids. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ local_rt_radix_tree *local;
+ shared_rt_radix_tree *shared;
+ } tree;
+
+ /* DSA area for TidStore if used */
+ dsa_area *area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+ TidStore *ts;
+
+ /* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ shared_rt_iter *shared;
+ local_rt_iter *local;
+ } tree_iter;
+
+ /* we returned all tids? */
+ bool finished;
+
+ /* save for the next iteration */
+ uint64 next_key;
+ uint64 next_val;
+
+ /* output for the caller */
+ TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+{
+ TidStore *ts;
+
+ ts = palloc0(sizeof(TidStore));
+
+ /*
+ * Create the radix tree for the main storage.
+ *
+ * Memory consumption depends on the number of stored tids, but also on the
+ * distribution of them, how the radix tree stores, and the memory management
+ * that backed the radix tree. The maximum bytes that a TidStore can
+ * use is specified by the max_bytes in tidstore_create(). We want the total
+ * amount of memory consumption by a TidStore not to exceed the max_bytes.
+ *
+ * In local TidStore cases, the radix tree uses slab allocators for each kind
+ * of node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+ * slab block for a new radix tree node, which is approximately 70kB. Therefore,
+ * we deduct 70kB from the max_bytes.
+ *
+ * In shared cases, DSA allocates the memory segments big enough to follow
+ * a geometric series that approximately doubles the total DSA size (see
+ * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+ * size and the simulation revealed, the 75% threshold for the maximum bytes
+ * perfectly works in case where the max_bytes is a power-of-2, and the 60%
+ * threshold works for other cases.
+ */
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+ float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+ LWTRANCHE_SHARED_TIDSTORE);
+
+ dp = dsa_allocate0(area, sizeof(TidStoreControl));
+ ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+ ts->control->max_bytes = (uint64) (max_bytes * ratio);
+ ts->area = area;
+
+ ts->control->magic = TIDSTORE_MAGIC;
+ LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+ ts->control->handle = dp;
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+ }
+ else
+ {
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+ ts->control->max_bytes = max_bytes - (70 * 1024);
+ }
+
+ ts->control->max_offset = max_offset;
+ ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+
+ if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
+ ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
+
+ /*
+ * We use tid encoding if the number of bits for the offset number doesn't
+ * fix in a value, uint64.
+ */
+ ts->control->offset_key_nbits =
+ ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+
+ return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ /* create per-backend state */
+ ts = palloc0(sizeof(TidStore));
+
+ /* Find the control object in shared memory */
+ control = handle;
+
+ /* Set up the TidStore */
+ ts->control = (TidStoreControl *) dsa_get_address(area, control);
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+ ts->area = area;
+
+ return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ shared_rt_detach(ts->tree.shared);
+ pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix
+ * tree.
+ */
+ ts->control->magic = 0;
+ dsa_free(ts->area, ts->control->handle);
+ shared_rt_free(ts->tree.shared);
+ }
+ else
+ {
+ pfree(ts->control);
+ local_rt_free(ts->tree.local);
+ }
+
+ pfree(ts);
+}
+
+/*
+ * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * entire TidStore but recreate only the radix tree storage.
+ */
+void
+tidstore_reset(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /*
+ * Free the radix tree and return allocated DSA segments to
+ * the operating system.
+ */
+ shared_rt_free(ts->tree.shared);
+ dsa_trim(ts->area);
+
+ /* Recreate the radix tree */
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+ LWTRANCHE_SHARED_TIDSTORE);
+
+ /* update the radix tree handle as we recreated it */
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+
+ LWLockRelease(&ts->control->lock);
+ }
+ else
+ {
+ local_rt_free(ts->tree.local);
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+ }
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ uint64 *values;
+ uint64 key;
+ uint64 prev_key;
+ uint64 off_bitmap = 0;
+ int idx;
+ const uint64 key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+ const int nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ values = palloc(sizeof(uint64) * nkeys);
+ key = prev_key = key_base;
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint32 off;
+
+ /* encode the tid to key and val */
+ key = encode_key_off(ts, blkno, offsets[i], &off);
+
+ /* make sure we scanned the line pointer array in order */
+ Assert(key >= prev_key);
+
+ if (key > prev_key)
+ {
+ idx = prev_key - key_base;
+ Assert(idx >= 0 && idx < nkeys);
+
+ /* write out offset bitmap for this key */
+ values[idx] = off_bitmap;
+
+ /* zero out any gaps up to the current key */
+ for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
+ values[empty_idx] = 0;
+
+ /* reset for current key -- the current offset will be handled below */
+ off_bitmap = 0;
+ prev_key = key;
+ }
+
+ off_bitmap |= UINT64CONST(1) << off;
+ }
+
+ /* save the final index for later */
+ idx = key - key_base;
+ /* write out last offset bitmap */
+ values[idx] = off_bitmap;
+
+ if (TidStoreIsShared(ts))
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /* insert the calculated key-values to the tree */
+ for (int i = 0; i <= idx; i++)
+ {
+ if (values[i])
+ {
+ key = key_base + i;
+
+ if (TidStoreIsShared(ts))
+ shared_rt_set(ts->tree.shared, key, &values[i]);
+ else
+ local_rt_set(ts->tree.local, key, &values[i]);
+ }
+ }
+
+ /* update statistics */
+ ts->control->num_tids += num_offsets;
+
+ if (TidStoreIsShared(ts))
+ LWLockRelease(&ts->control->lock);
+
+ pfree(values);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val = 0;
+ uint32 off;
+ bool found;
+
+ key = tid_to_key_off(ts, tid, &off);
+
+ if (TidStoreIsShared(ts))
+ found = shared_rt_search(ts->tree.shared, key, &val);
+ else
+ found = local_rt_search(ts->tree.local, key, &val);
+
+ if (!found)
+ return false;
+
+ return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during the
+ * iteration, so tidstore_end_iterate() needs to called when finished.
+ *
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+ TidStoreIter *iter;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ iter = palloc0(sizeof(TidStoreIter));
+ iter->ts = ts;
+
+ iter->result.blkno = InvalidBlockNumber;
+ iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
+
+ if (TidStoreIsShared(ts))
+ iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+ else
+ iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+ /* If the TidStore is empty, there is no business */
+ if (tidstore_num_tids(ts) == 0)
+ iter->finished = true;
+
+ return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+
+ return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a pointer to TidStoreIterResult that has tids
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+ TidStoreIterResult *result = &(iter->result);
+
+ if (iter->finished)
+ return NULL;
+
+ if (BlockNumberIsValid(result->blkno))
+ {
+ /* Process the previously collected key-value */
+ result->num_offsets = 0;
+ tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (tidstore_iter_kv(iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = key_get_blkno(iter->ts, key);
+
+ if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+ {
+ /*
+ * We got a key-value pair for a different block. So return the
+ * collected tids, and remember the key-value for the next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+ return result;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_extract_tids(iter, key, val);
+ }
+
+ iter->finished = true;
+ return result;
+}
+
+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+ if (TidStoreIsShared(iter->ts))
+ shared_rt_end_iterate(iter->tree_iter.shared);
+ else
+ local_rt_end_iterate(iter->tree_iter.local);
+
+ pfree(iter->result.offsets);
+ pfree(iter);
+}
+
+/* Return the number of tids we collected so far */
+int64
+tidstore_num_tids(TidStore *ts)
+{
+ uint64 num_tids;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ if (!TidStoreIsShared(ts))
+ return ts->control->num_tids;
+
+ LWLockAcquire(&ts->control->lock, LW_SHARED);
+ num_tids = ts->control->num_tids;
+ LWLockRelease(&ts->control->lock);
+
+ return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+tidstore_max_memory(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+tidstore_memory_usage(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+ return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->handle;
+}
+
+/* Extract tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+ TidStoreIterResult *result = (&iter->result);
+
+ for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ if ((val & (UINT64CONST(1) << i)) == 0)
+ continue;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= i;
+
+ off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+
+ Assert(result->num_offsets < iter->ts->control->max_offset);
+ result->offsets[result->num_offsets++] = off;
+ }
+
+ result->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, uint64 key)
+{
+ return (BlockNumber) (key >> ts->control->offset_key_nbits);
+}
+
+/* Encode a tid to key and offset */
+static inline uint64
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off)
+{
+ uint32 offset = ItemPointerGetOffsetNumber(tid);
+ BlockNumber block = ItemPointerGetBlockNumber(tid);
+
+ return encode_key_off(ts, block, offset, off);
+}
+
+/* encode a block and offset to a key and partial offset */
+static inline uint64
+encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off)
+{
+ uint64 key;
+ uint64 tid_i;
+
+ tid_i = offset | ((uint64) block << ts->control->offset_nbits);
+
+ *off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1);
+ key = tid_i >> TIDSTORE_VALUE_NBITS;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
"SharedTupleStore",
/* LWTRANCHE_SHARED_TIDBITMAP: */
"SharedTidBitmap",
+ /* LWTRANCHE_SHARED_TIDSTORE: */
+ "SharedTidStore",
/* LWTRANCHE_PARALLEL_APPEND: */
"ParallelAppend",
/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..a35a52124a
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+ BlockNumber blkno;
+ OffsetNumber *offsets;
+ int num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern int64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern size_t tidstore_max_memory(TidStore *ts);
+extern size_t tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif /* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
LWTRANCHE_SHARED_TUPLESTORE,
LWTRANCHE_SHARED_TIDBITMAP,
+ LWTRANCHE_SHARED_TIDSTORE,
LWTRANCHE_PARALLEL_APPEND,
LWTRANCHE_PER_XACT_PREDICATE_LIST,
LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
test_rls_hooks \
test_shm_mq \
test_slru \
+ test_tidstore \
unsafe_tests \
worker_spi
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
subdir('test_rls_hooks')
subdir('test_shm_mq')
subdir('test_slru')
+subdir('test_tidstore')
subdir('unsafe_tests')
subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+ $(WIN32RES) \
+ test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE: testing empty tidstore
+NOTICE: testing basic operations
+ test_tidstore
+---------------
+
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+ 'test_tidstore.c',
+)
+
+if host_system == 'windows'
+ test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_tidstore',
+ '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+ test_tidstore_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+ 'test_tidstore.control',
+ 'test_tidstore--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_tidstore',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_tidstore',
+ ],
+ },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..9b849ae8e8
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,195 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ * Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+ ItemPointerData tid;
+ bool found;
+
+ ItemPointerSet(&tid, blkno, off);
+
+ found = tidstore_lookup_tid(ts, &tid);
+
+ if (found != expect)
+ elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+ blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS 5
+#define TEST_TIDSTORE_NUM_OFFSETS 5
+
+ TidStore *ts;
+ TidStoreIter *iter;
+ TidStoreIterResult *iter_result;
+ BlockNumber blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+ };
+ BlockNumber blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+ };
+ OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+ int blk_idx;
+
+ /* prepare the offset array */
+ offs[0] = FirstOffsetNumber;
+ offs[1] = FirstOffsetNumber + 1;
+ offs[2] = max_offset / 2;
+ offs[3] = max_offset - 1;
+ offs[4] = max_offset;
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+
+ /* add tids */
+ for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+ tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* lookup test */
+ for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+ {
+ bool expect = false;
+ for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+ {
+ if (offs[i] == off)
+ {
+ expect = true;
+ break;
+ }
+ }
+
+ check_tid(ts, 0, off, expect);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, expect);
+ }
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+ elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+ tidstore_num_tids(ts),
+ TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* iteration test */
+ iter = tidstore_begin_iterate(ts);
+ blk_idx = 0;
+ while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+ {
+ /* check the returned block number */
+ if (blks_sorted[blk_idx] != iter_result->blkno)
+ elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+ iter_result->blkno, blks_sorted[blk_idx]);
+
+ /* check the returned offset numbers */
+ if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+ elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+ iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+ for (int i = 0; i < iter_result->num_offsets; i++)
+ {
+ if (offs[i] != iter_result->offsets[i])
+ elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+ iter_result->offsets[i], iter_result->blkno, offs[i]);
+ }
+
+ blk_idx++;
+ }
+
+ if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+ elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+ blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+ /* remove all tids */
+ tidstore_reset(ts);
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+ /* lookup test for empty store */
+ for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+ off++)
+ {
+ check_tid(ts, 0, off, false);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, false);
+ }
+
+ tidstore_destroy(ts);
+}
+
+static void
+test_empty(void)
+{
+ TidStore *ts;
+ TidStoreIter *iter;
+ ItemPointerData tid;
+
+ elog(NOTICE, "testing empty tidstore");
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+
+ ItemPointerSet(&tid, 0, FirstOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+ ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+ MaxBlockNumber, MaxOffsetNumber);
+
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+ if (tidstore_is_full(ts))
+ elog(ERROR, "tidstore_is_full on empty store returned true");
+
+ iter = tidstore_begin_iterate(ts);
+
+ if (tidstore_iterate_next(iter) != NULL)
+ elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+ tidstore_end_iterate(iter);
+
+ tidstore_destroy(ts);
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ elog(NOTICE, "testing basic operations");
+ test_basic(MaxHeapTuplesPerPage);
+ test_basic(10);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
--
2.39.1
v27-0006-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchtext/x-patch; charset=US-ASCII; name=v27-0006-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchDownload
From b0515a40b3aa4709047c7b70b9c0cadded979d15 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v27 6/9] Tool for measuring radix tree and tidstore
performance
Includes Meson support, but commented out to avoid warnings
XXX: Not for commit
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 87 +++
contrib/bench_radix_tree/bench_radix_tree.c | 717 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/meson.build | 33 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
contrib/meson.build | 1 +
8 files changed, 894 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/meson.build
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..fbf51c1086
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,87 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_tidstore_load(
+minblk int4,
+maxblk int4,
+OUT mem_allocated int8,
+OUT load_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..b5ad75364c
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,717 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+//#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+PG_FUNCTION_INFO_V1(bench_tidstore_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+Datum
+bench_tidstore_load(PG_FUNCTION_ARGS)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ TidStore *ts;
+ OffsetNumber *offs;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_ms;
+ TupleDesc tupdesc;
+ Datum values[2];
+ bool nulls[2] = {false};
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ offs = palloc(sizeof(OffsetNumber) * TIDS_PER_BLOCK_FOR_LOAD);
+ for (int i = 0; i < TIDS_PER_BLOCK_FOR_LOAD; i++)
+ offs[i] = i + 1; /* FirstOffsetNumber is 1 */
+
+ ts = tidstore_create(1 * 1024L * 1024L * 1024L, MaxHeapTuplesPerPage, NULL);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* load tids */
+ start_time = GetCurrentTimestamp();
+ for (BlockNumber blkno = minblk; blkno < maxblk; blkno++)
+ tidstore_add_tids(ts, blkno, offs, TIDS_PER_BLOCK_FOR_LOAD);
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_ms = secs * 1000 + usecs / 1000;
+
+ values[0] = Int64GetDatum(tidstore_memory_usage(ts));
+ values[1] = Int64GetDatum(load_ms);
+
+ tidstore_destroy(ts);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ rt_radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, &val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, &val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, &key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ int64 search_time_ms;
+ Datum values[3] = {0};
+ bool nulls[3] = {0};
+ /* from trial and error */
+ uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (!PG_ARGISNULL(1))
+ {
+ char *filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+ if (sscanf(filter_str, "0x%lX", &filter) == 0)
+ elog(ERROR, "invalid filter string %s", filter_str);
+ }
+ elog(NOTICE, "bench with filter 0x%lX", filter);
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 hash = hash64(i);
+ uint64 key = hash & filter;
+
+ rt_set(rt, key, &key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 hash = hash64(i);
+ uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+ values[2] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ uint64 key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, &key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_sparseload_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ uint64 r,
+ h;
+ uint64 key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ key_id = 0;
+
+ for (r = 1; r <= fanout; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (h);
+
+ key_id++;
+ rt_set(rt, key, &key_id);
+ }
+ }
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+
+ /* measure sparse deletion and re-loading */
+ start_time = GetCurrentTimestamp();
+
+ for (int t = 0; t<10000; t++)
+ {
+ /* delete one key in each leaf */
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ rt_delete(rt, key);
+ }
+
+ /* add them all back */
+ key_id = 0;
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ key_id++;
+ rt_set(rt, key, &key_id);
+ }
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_sparseload_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+ 'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+ bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'bench_radix_tree',
+ '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+ bench_radix_tree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+ 'bench_radix_tree.control',
+ 'bench_radix_tree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'bench_radix_tree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'bench_radix_tree',
+ ],
+ },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
subdir('auth_delay')
subdir('auto_explain')
subdir('basic_archive')
+#subdir('bench_radix_tree')
subdir('bloom')
subdir('basebackup_to_shell')
subdir('bool_plperl')
--
2.39.1
v27-0008-Measure-iteration-of-tidstore.patchtext/x-patch; charset=US-ASCII; name=v27-0008-Measure-iteration-of-tidstore.patchDownload
From 72bb462b1dab005cbc2aff265baedbaaee62cb2b Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 17:02:53 +0700
Subject: [PATCH v27 8/9] Measure iteration of tidstore
---
.../bench_radix_tree--1.0.sql | 3 +-
contrib/bench_radix_tree/bench_radix_tree.c | 40 ++++++++++++++++---
contrib/meson.build | 2 +-
3 files changed, 38 insertions(+), 7 deletions(-)
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index fbf51c1086..ad66265e23 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -80,7 +80,8 @@ create function bench_tidstore_load(
minblk int4,
maxblk int4,
OUT mem_allocated int8,
-OUT load_ms int8
+OUT load_ms int8,
+OUT iter_ms int8
)
returns record
as 'MODULE_PATHNAME'
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index b5ad75364c..6e5149e2c4 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -176,15 +176,18 @@ bench_tidstore_load(PG_FUNCTION_ARGS)
BlockNumber minblk = PG_GETARG_INT32(0);
BlockNumber maxblk = PG_GETARG_INT32(1);
TidStore *ts;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
OffsetNumber *offs;
TimestampTz start_time,
end_time;
long secs;
int usecs;
int64 load_ms;
+ int64 iter_ms;
TupleDesc tupdesc;
- Datum values[2];
- bool nulls[2] = {false};
+ Datum values[3];
+ bool nulls[3] = {false};
/* Build a tuple descriptor for our result type */
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
@@ -196,9 +199,6 @@ bench_tidstore_load(PG_FUNCTION_ARGS)
ts = tidstore_create(1 * 1024L * 1024L * 1024L, MaxHeapTuplesPerPage, NULL);
- elog(NOTICE, "sleeping for 2 seconds...");
- pg_usleep(2 * 1000000L);
-
/* load tids */
start_time = GetCurrentTimestamp();
for (BlockNumber blkno = minblk; blkno < maxblk; blkno++)
@@ -207,8 +207,22 @@ bench_tidstore_load(PG_FUNCTION_ARGS)
TimestampDifference(start_time, end_time, &secs, &usecs);
load_ms = secs * 1000 + usecs / 1000;
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* iterate through tids */
+ iter = tidstore_begin_iterate(ts);
+ start_time = GetCurrentTimestamp();
+ while ((result = tidstore_iterate_next(iter)) != NULL)
+ ;
+ tidstore_end_iterate(iter);
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ iter_ms = secs * 1000 + usecs / 1000;
+
values[0] = Int64GetDatum(tidstore_memory_usage(ts));
values[1] = Int64GetDatum(load_ms);
+ values[2] = Int64GetDatum(iter_ms);
tidstore_destroy(ts);
PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
@@ -715,3 +729,19 @@ bench_node128_load(PG_FUNCTION_ARGS)
rt_free(rt);
PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
}
+
+/* to silence warnings about unused iter functions */
+static void pg_attribute_unused()
+stub_iter()
+{
+ rt_radix_tree *rt;
+ rt_iter *iter;
+ uint64 key = 1;
+ uint64 value = 1;
+
+ rt = rt_create(CurrentMemoryContext);
+
+ iter = rt_begin_iterate(rt);
+ rt_iterate_next(iter, &key, &value);
+ rt_end_iterate(iter);
+}
\ No newline at end of file
diff --git a/contrib/meson.build b/contrib/meson.build
index 52253de793..421d469f8c 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,7 +12,7 @@ subdir('amcheck')
subdir('auth_delay')
subdir('auto_explain')
subdir('basic_archive')
-#subdir('bench_radix_tree')
+subdir('bench_radix_tree')
subdir('bloom')
subdir('basebackup_to_shell')
subdir('bool_plperl')
--
2.39.1
v27-0007-Prevent-inlining-of-interface-functions-for-shme.patchtext/x-patch; charset=US-ASCII; name=v27-0007-Prevent-inlining-of-interface-functions-for-shme.patchDownload
From 54ab02eb2188382185436059ff6e7ad95d970c5d Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 17:00:31 +0700
Subject: [PATCH v27 7/9] Prevent inlining of interface functions for shmem
---
src/backend/access/common/tidstore.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index ad8c0866e2..d1b4675ea4 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -84,7 +84,7 @@
#define RT_PREFIX shared_rt
#define RT_SHMEM
-#define RT_SCOPE static
+#define RT_SCOPE static pg_noinline
#define RT_DECLARE
#define RT_DEFINE
#define RT_VALUE_TYPE uint64
--
2.39.1
v27-0009-Speed-up-tidstore_iter_extract_tids.patchtext/x-patch; charset=US-ASCII; name=v27-0009-Speed-up-tidstore_iter_extract_tids.patchDownload
From 8ccc66211973bcc44a6bad45c05302ca743c1489 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 17:53:37 +0700
Subject: [PATCH v27 9/9] Speed up tidstore_iter_extract_tids()
---
src/backend/access/common/tidstore.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index d1b4675ea4..5a897c01f7 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -632,21 +632,21 @@ tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
{
TidStoreIterResult *result = (&iter->result);
- for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ while (val)
{
uint64 tid_i;
OffsetNumber off;
- if ((val & (UINT64CONST(1) << i)) == 0)
- continue;
-
tid_i = key << TIDSTORE_VALUE_NBITS;
- tid_i |= i;
+ tid_i |= pg_rightmost_one_pos64(val);
off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
Assert(result->num_offsets < iter->ts->control->max_offset);
result->offsets[result->num_offsets++] = off;
+
+ /* unset the rightmost bit */
+ val &= ~pg_rightmost_one64(val);
}
result->blkno = key_get_blkno(iter->ts, key);
--
2.39.1
The benchmark module shouldn't have been un-commented-out, so attached a
revert of that.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v28-0008-Measure-iteration-of-tidstore.patchtext/x-patch; charset=US-ASCII; name=v28-0008-Measure-iteration-of-tidstore.patchDownload
From 72bb462b1dab005cbc2aff265baedbaaee62cb2b Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 17:02:53 +0700
Subject: [PATCH v28 08/10] Measure iteration of tidstore
---
.../bench_radix_tree--1.0.sql | 3 +-
contrib/bench_radix_tree/bench_radix_tree.c | 40 ++++++++++++++++---
contrib/meson.build | 2 +-
3 files changed, 38 insertions(+), 7 deletions(-)
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index fbf51c1086..ad66265e23 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -80,7 +80,8 @@ create function bench_tidstore_load(
minblk int4,
maxblk int4,
OUT mem_allocated int8,
-OUT load_ms int8
+OUT load_ms int8,
+OUT iter_ms int8
)
returns record
as 'MODULE_PATHNAME'
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index b5ad75364c..6e5149e2c4 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -176,15 +176,18 @@ bench_tidstore_load(PG_FUNCTION_ARGS)
BlockNumber minblk = PG_GETARG_INT32(0);
BlockNumber maxblk = PG_GETARG_INT32(1);
TidStore *ts;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
OffsetNumber *offs;
TimestampTz start_time,
end_time;
long secs;
int usecs;
int64 load_ms;
+ int64 iter_ms;
TupleDesc tupdesc;
- Datum values[2];
- bool nulls[2] = {false};
+ Datum values[3];
+ bool nulls[3] = {false};
/* Build a tuple descriptor for our result type */
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
@@ -196,9 +199,6 @@ bench_tidstore_load(PG_FUNCTION_ARGS)
ts = tidstore_create(1 * 1024L * 1024L * 1024L, MaxHeapTuplesPerPage, NULL);
- elog(NOTICE, "sleeping for 2 seconds...");
- pg_usleep(2 * 1000000L);
-
/* load tids */
start_time = GetCurrentTimestamp();
for (BlockNumber blkno = minblk; blkno < maxblk; blkno++)
@@ -207,8 +207,22 @@ bench_tidstore_load(PG_FUNCTION_ARGS)
TimestampDifference(start_time, end_time, &secs, &usecs);
load_ms = secs * 1000 + usecs / 1000;
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* iterate through tids */
+ iter = tidstore_begin_iterate(ts);
+ start_time = GetCurrentTimestamp();
+ while ((result = tidstore_iterate_next(iter)) != NULL)
+ ;
+ tidstore_end_iterate(iter);
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ iter_ms = secs * 1000 + usecs / 1000;
+
values[0] = Int64GetDatum(tidstore_memory_usage(ts));
values[1] = Int64GetDatum(load_ms);
+ values[2] = Int64GetDatum(iter_ms);
tidstore_destroy(ts);
PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
@@ -715,3 +729,19 @@ bench_node128_load(PG_FUNCTION_ARGS)
rt_free(rt);
PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
}
+
+/* to silence warnings about unused iter functions */
+static void pg_attribute_unused()
+stub_iter()
+{
+ rt_radix_tree *rt;
+ rt_iter *iter;
+ uint64 key = 1;
+ uint64 value = 1;
+
+ rt = rt_create(CurrentMemoryContext);
+
+ iter = rt_begin_iterate(rt);
+ rt_iterate_next(iter, &key, &value);
+ rt_end_iterate(iter);
+}
\ No newline at end of file
diff --git a/contrib/meson.build b/contrib/meson.build
index 52253de793..421d469f8c 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,7 +12,7 @@ subdir('amcheck')
subdir('auth_delay')
subdir('auto_explain')
subdir('basic_archive')
-#subdir('bench_radix_tree')
+subdir('bench_radix_tree')
subdir('bloom')
subdir('basebackup_to_shell')
subdir('bool_plperl')
--
2.39.1
v28-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v28-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload
From 149a49f51f7a16b7c1eb762e704f1ec476ecb65a Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v28 02/10] Move some bitmap logic out of bitmapset.c
Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.
Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().
Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.
Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
src/backend/nodes/bitmapset.c | 36 ++------------------------------
src/include/nodes/bitmapset.h | 16 ++++++++++++--
src/include/port/pg_bitutils.h | 31 +++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 -
4 files changed, 47 insertions(+), 37 deletions(-)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x))
/*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
{
int result;
- w = RIGHTMOST_ONE(w);
+ w = bmw_rightmost_one(w);
a->words[wordnum] &= ~w;
result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 3d2225e1ae..5f9a511b4a 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
#else
#define BITS_PER_BITMAPWORD 32
typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
#endif
@@ -75,6 +73,20 @@ typedef enum
BMS_MULTIPLE /* >1 member */
} BMS_Membership;
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w) pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
+#define bmw_popcount(w) pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w) pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
+#define bmw_popcount(w) pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
/*
* function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+ int32 result = (int32) word & -((int32) word);
+ return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+ int64 result = (int64) word & -((int64) word);
+ return (uint64) result;
+}
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 36d1dc0117..a0c60feade 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3669,7 +3669,6 @@ shmem_request_hook_type
shmem_startup_hook_type
sig_atomic_t
sigjmp_buf
-signedbitmapword
sigset_t
size_t
slist_head
--
2.39.1
v28-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchtext/x-patch; charset=US-ASCII; name=v28-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchDownload
From d577ef9d9755e7ca4d3722c1a044381a81d66244 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v28 01/10] Introduce helper SIMD functions for small byte
arrays
vector8_min - helper for emulating ">=" semantics
vector8_highbit_mask - used to turn the result of a vector
comparison into a bitmask
Masahiko Sawada
Reviewed by Nathan Bossart, additional adjustments by me
Discussion: https://www.postgresql.org/message-id/CAD21AoDap240WDDdUDE0JMpCmuMMnGajrKrkCRxM7zn9Xk3JRA%40mail.gmail.com
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..350e2caaea 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -79,6 +79,7 @@ static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#endif
/* arithmetic operations */
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -299,6 +301,36 @@ vector32_is_highbit_set(const Vector32 v)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Return a bitmask formed from the high-bit of each element.
+ */
+#ifndef USE_NO_SIMD
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ /*
+ * Note: There is a faster way to do this, but it returns a uint64 and
+ * and if the caller wanted to extract the bit position using CTZ,
+ * it would have to divide that result by 4.
+ */
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
/*
* Return the bitwise OR of the inputs
*/
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Given two vectors, return a vector with the minimum element of each.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.39.1
v28-0005-Do-bitmap-conversion-in-one-place-rather-than-fo.patchtext/x-patch; charset=US-ASCII; name=v28-0005-Do-bitmap-conversion-in-one-place-rather-than-fo.patchDownload
From dba9497b5b587da873fbb2de89570ec8b36d604b Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 12 Feb 2023 15:17:40 +0700
Subject: [PATCH v28 05/10] Do bitmap conversion in one place rather than
forcing callers to do it
---
src/backend/access/common/tidstore.c | 31 +++++++++++++++-------------
1 file changed, 17 insertions(+), 14 deletions(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index ff8e66936e..ad8c0866e2 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -70,6 +70,7 @@
* and value, respectively.
*/
#define TIDSTORE_VALUE_NBITS 6 /* log(64, 2) */
+#define TIDSTORE_OFFSET_MASK ((1 << TIDSTORE_VALUE_NBITS) - 1)
/* A magic value used to identify our TidStores. */
#define TIDSTORE_MAGIC 0x826f6a10
@@ -158,8 +159,8 @@ typedef struct TidStoreIter
static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
-static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off);
-static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off);
+static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit);
/*
* Create a TidStore. The returned object is allocated in backend-local memory.
@@ -376,10 +377,10 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
for (int i = 0; i < num_offsets; i++)
{
- uint32 off;
+ uint64 off_bit;
/* encode the tid to key and val */
- key = encode_key_off(ts, blkno, offsets[i], &off);
+ key = encode_key_off(ts, blkno, offsets[i], &off_bit);
/* make sure we scanned the line pointer array in order */
Assert(key >= prev_key);
@@ -401,7 +402,7 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
prev_key = key;
}
- off_bitmap |= UINT64CONST(1) << off;
+ off_bitmap |= off_bit;
}
/* save the final index for later */
@@ -441,10 +442,10 @@ tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
{
uint64 key;
uint64 val = 0;
- uint32 off;
+ uint64 off_bit;
bool found;
- key = tid_to_key_off(ts, tid, &off);
+ key = tid_to_key_off(ts, tid, &off_bit);
if (TidStoreIsShared(ts))
found = shared_rt_search(ts->tree.shared, key, &val);
@@ -454,7 +455,7 @@ tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
if (!found)
return false;
- return (val & (UINT64CONST(1) << off)) != 0;
+ return (val & off_bit) != 0;
}
/*
@@ -660,26 +661,28 @@ key_get_blkno(TidStore *ts, uint64 key)
/* Encode a tid to key and offset */
static inline uint64
-tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off)
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit)
{
uint32 offset = ItemPointerGetOffsetNumber(tid);
BlockNumber block = ItemPointerGetBlockNumber(tid);
- return encode_key_off(ts, block, offset, off);
+ return encode_key_off(ts, block, offset, off_bit);
}
/* encode a block and offset to a key and partial offset */
static inline uint64
-encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off)
+encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit)
{
uint64 key;
uint64 tid_i;
+ uint32 off_lower;
- tid_i = offset | ((uint64) block << ts->control->offset_nbits);
+ off_lower = offset & TIDSTORE_OFFSET_MASK;
+ Assert(off_lower < (sizeof(uint64) * BITS_PER_BYTE));
- *off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1);
+ *off_bit = UINT64CONST(1) << off_lower;
+ tid_i = offset | ((uint64) block << ts->control->offset_nbits);
key = tid_i >> TIDSTORE_VALUE_NBITS;
- Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
return key;
}
--
2.39.1
v28-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchtext/x-patch; charset=US-ASCII; name=v28-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload
From 1be520c83274bc3a2f068689e665c254c8e3c04e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v28 04/10] Add TIDStore, to store sets of TIDs
(ItemPointerData) efficiently.
The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.
The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.
This includes a unit test module, in src/test/modules/test_tidstore.
---
doc/src/sgml/monitoring.sgml | 4 +
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 685 ++++++++++++++++++
src/backend/storage/lmgr/lwlock.c | 2 +
src/include/access/tidstore.h | 49 ++
src/include/storage/lwlock.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_tidstore/Makefile | 23 +
.../test_tidstore/expected/test_tidstore.out | 13 +
src/test/modules/test_tidstore/meson.build | 35 +
.../test_tidstore/sql/test_tidstore.sql | 7 +
.../test_tidstore/test_tidstore--1.0.sql | 8 +
.../modules/test_tidstore/test_tidstore.c | 195 +++++
.../test_tidstore/test_tidstore.control | 4 +
16 files changed, 1030 insertions(+)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
create mode 100644 src/test/modules/test_tidstore/Makefile
create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
create mode 100644 src/test/modules/test_tidstore/meson.build
create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
create mode 100644 src/test/modules/test_tidstore/test_tidstore.control
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b246ddc634..e44387d2c1 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2192,6 +2192,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Waiting to access a shared TID bitmap during a parallel bitmap
index scan.</entry>
</row>
+ <row>
+ <entry><literal>SharedTidStore</literal></entry>
+ <entry>Waiting to access a shared TID store.</entry>
+ </row>
<row>
<entry><literal>SharedTupleStore</literal></entry>
<entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..ff8e66936e
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,685 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach().
+ *
+ * Regarding the concurrency, it basically relies on the concurrency support in
+ * the radix tree, but we acquires the lock on a TidStore in some cases, for
+ * example, when to reset the store and when to access the number tids in the
+ * store (num_tids).
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, tids are represented as a pair of 64-bit key and
+ * 64-bit value. First, we construct 64-bit unsigned integer by combining
+ * the block number and the offset number. The number of bits used for the
+ * offset number is specified by max_offsets in tidstore_create(). We are
+ * frugal with the bits, because smaller keys could help keeping the radix
+ * tree shallow.
+ *
+ * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. That
+ * is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits
+ * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
+ * as the key:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ * |----| value
+ * |---------------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ *
+ * If the number of bits for offset number fits in a 64-bit value, we don't
+ * encode tids but directly use the block number and the offset number as key
+ * and value, respectively.
+ */
+#define TIDSTORE_VALUE_NBITS 6 /* log(64, 2) */
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The control object for a TidStore */
+typedef struct TidStoreControl
+{
+ /* the number of tids in the store */
+ int64 num_tids;
+
+ /* These values are never changed after creation */
+ size_t max_bytes; /* the maximum bytes a TidStore can use */
+ int max_offset; /* the maximum offset number */
+ int offset_nbits; /* the number of bits required for max_offset */
+ int offset_key_nbits; /* the number of bits of a offset number
+ * used for the key */
+
+ /* The below fields are used only in shared case */
+
+ uint32 magic;
+ LWLock lock;
+
+ /* handles for TidStore and radix tree */
+ tidstore_handle handle;
+ shared_rt_handle tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+ /*
+ * Control object. This is allocated in DSA area 'area' in the shared
+ * case, otherwise in backend-local memory.
+ */
+ TidStoreControl *control;
+
+ /* Storage for Tids. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ local_rt_radix_tree *local;
+ shared_rt_radix_tree *shared;
+ } tree;
+
+ /* DSA area for TidStore if used */
+ dsa_area *area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+ TidStore *ts;
+
+ /* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ shared_rt_iter *shared;
+ local_rt_iter *local;
+ } tree_iter;
+
+ /* we returned all tids? */
+ bool finished;
+
+ /* save for the next iteration */
+ uint64 next_key;
+ uint64 next_val;
+
+ /* output for the caller */
+ TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+{
+ TidStore *ts;
+
+ ts = palloc0(sizeof(TidStore));
+
+ /*
+ * Create the radix tree for the main storage.
+ *
+ * Memory consumption depends on the number of stored tids, but also on the
+ * distribution of them, how the radix tree stores, and the memory management
+ * that backed the radix tree. The maximum bytes that a TidStore can
+ * use is specified by the max_bytes in tidstore_create(). We want the total
+ * amount of memory consumption by a TidStore not to exceed the max_bytes.
+ *
+ * In local TidStore cases, the radix tree uses slab allocators for each kind
+ * of node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+ * slab block for a new radix tree node, which is approximately 70kB. Therefore,
+ * we deduct 70kB from the max_bytes.
+ *
+ * In shared cases, DSA allocates the memory segments big enough to follow
+ * a geometric series that approximately doubles the total DSA size (see
+ * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+ * size and the simulation revealed, the 75% threshold for the maximum bytes
+ * perfectly works in case where the max_bytes is a power-of-2, and the 60%
+ * threshold works for other cases.
+ */
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+ float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+ LWTRANCHE_SHARED_TIDSTORE);
+
+ dp = dsa_allocate0(area, sizeof(TidStoreControl));
+ ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+ ts->control->max_bytes = (uint64) (max_bytes * ratio);
+ ts->area = area;
+
+ ts->control->magic = TIDSTORE_MAGIC;
+ LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+ ts->control->handle = dp;
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+ }
+ else
+ {
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+ ts->control->max_bytes = max_bytes - (70 * 1024);
+ }
+
+ ts->control->max_offset = max_offset;
+ ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+
+ if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
+ ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
+
+ /*
+ * We use tid encoding if the number of bits for the offset number doesn't
+ * fix in a value, uint64.
+ */
+ ts->control->offset_key_nbits =
+ ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+
+ return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ /* create per-backend state */
+ ts = palloc0(sizeof(TidStore));
+
+ /* Find the control object in shared memory */
+ control = handle;
+
+ /* Set up the TidStore */
+ ts->control = (TidStoreControl *) dsa_get_address(area, control);
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+ ts->area = area;
+
+ return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ shared_rt_detach(ts->tree.shared);
+ pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix
+ * tree.
+ */
+ ts->control->magic = 0;
+ dsa_free(ts->area, ts->control->handle);
+ shared_rt_free(ts->tree.shared);
+ }
+ else
+ {
+ pfree(ts->control);
+ local_rt_free(ts->tree.local);
+ }
+
+ pfree(ts);
+}
+
+/*
+ * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * entire TidStore but recreate only the radix tree storage.
+ */
+void
+tidstore_reset(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /*
+ * Free the radix tree and return allocated DSA segments to
+ * the operating system.
+ */
+ shared_rt_free(ts->tree.shared);
+ dsa_trim(ts->area);
+
+ /* Recreate the radix tree */
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+ LWTRANCHE_SHARED_TIDSTORE);
+
+ /* update the radix tree handle as we recreated it */
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+
+ LWLockRelease(&ts->control->lock);
+ }
+ else
+ {
+ local_rt_free(ts->tree.local);
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+ }
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ uint64 *values;
+ uint64 key;
+ uint64 prev_key;
+ uint64 off_bitmap = 0;
+ int idx;
+ const uint64 key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+ const int nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ values = palloc(sizeof(uint64) * nkeys);
+ key = prev_key = key_base;
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint32 off;
+
+ /* encode the tid to key and val */
+ key = encode_key_off(ts, blkno, offsets[i], &off);
+
+ /* make sure we scanned the line pointer array in order */
+ Assert(key >= prev_key);
+
+ if (key > prev_key)
+ {
+ idx = prev_key - key_base;
+ Assert(idx >= 0 && idx < nkeys);
+
+ /* write out offset bitmap for this key */
+ values[idx] = off_bitmap;
+
+ /* zero out any gaps up to the current key */
+ for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
+ values[empty_idx] = 0;
+
+ /* reset for current key -- the current offset will be handled below */
+ off_bitmap = 0;
+ prev_key = key;
+ }
+
+ off_bitmap |= UINT64CONST(1) << off;
+ }
+
+ /* save the final index for later */
+ idx = key - key_base;
+ /* write out last offset bitmap */
+ values[idx] = off_bitmap;
+
+ if (TidStoreIsShared(ts))
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /* insert the calculated key-values to the tree */
+ for (int i = 0; i <= idx; i++)
+ {
+ if (values[i])
+ {
+ key = key_base + i;
+
+ if (TidStoreIsShared(ts))
+ shared_rt_set(ts->tree.shared, key, &values[i]);
+ else
+ local_rt_set(ts->tree.local, key, &values[i]);
+ }
+ }
+
+ /* update statistics */
+ ts->control->num_tids += num_offsets;
+
+ if (TidStoreIsShared(ts))
+ LWLockRelease(&ts->control->lock);
+
+ pfree(values);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val = 0;
+ uint32 off;
+ bool found;
+
+ key = tid_to_key_off(ts, tid, &off);
+
+ if (TidStoreIsShared(ts))
+ found = shared_rt_search(ts->tree.shared, key, &val);
+ else
+ found = local_rt_search(ts->tree.local, key, &val);
+
+ if (!found)
+ return false;
+
+ return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during the
+ * iteration, so tidstore_end_iterate() needs to called when finished.
+ *
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+ TidStoreIter *iter;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ iter = palloc0(sizeof(TidStoreIter));
+ iter->ts = ts;
+
+ iter->result.blkno = InvalidBlockNumber;
+ iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
+
+ if (TidStoreIsShared(ts))
+ iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+ else
+ iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+ /* If the TidStore is empty, there is no business */
+ if (tidstore_num_tids(ts) == 0)
+ iter->finished = true;
+
+ return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+
+ return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a pointer to TidStoreIterResult that has tids
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+ TidStoreIterResult *result = &(iter->result);
+
+ if (iter->finished)
+ return NULL;
+
+ if (BlockNumberIsValid(result->blkno))
+ {
+ /* Process the previously collected key-value */
+ result->num_offsets = 0;
+ tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (tidstore_iter_kv(iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = key_get_blkno(iter->ts, key);
+
+ if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+ {
+ /*
+ * We got a key-value pair for a different block. So return the
+ * collected tids, and remember the key-value for the next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+ return result;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_extract_tids(iter, key, val);
+ }
+
+ iter->finished = true;
+ return result;
+}
+
+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+ if (TidStoreIsShared(iter->ts))
+ shared_rt_end_iterate(iter->tree_iter.shared);
+ else
+ local_rt_end_iterate(iter->tree_iter.local);
+
+ pfree(iter->result.offsets);
+ pfree(iter);
+}
+
+/* Return the number of tids we collected so far */
+int64
+tidstore_num_tids(TidStore *ts)
+{
+ uint64 num_tids;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ if (!TidStoreIsShared(ts))
+ return ts->control->num_tids;
+
+ LWLockAcquire(&ts->control->lock, LW_SHARED);
+ num_tids = ts->control->num_tids;
+ LWLockRelease(&ts->control->lock);
+
+ return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+tidstore_max_memory(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+tidstore_memory_usage(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+ return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->handle;
+}
+
+/* Extract tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+ TidStoreIterResult *result = (&iter->result);
+
+ for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ if ((val & (UINT64CONST(1) << i)) == 0)
+ continue;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= i;
+
+ off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+
+ Assert(result->num_offsets < iter->ts->control->max_offset);
+ result->offsets[result->num_offsets++] = off;
+ }
+
+ result->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, uint64 key)
+{
+ return (BlockNumber) (key >> ts->control->offset_key_nbits);
+}
+
+/* Encode a tid to key and offset */
+static inline uint64
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off)
+{
+ uint32 offset = ItemPointerGetOffsetNumber(tid);
+ BlockNumber block = ItemPointerGetBlockNumber(tid);
+
+ return encode_key_off(ts, block, offset, off);
+}
+
+/* encode a block and offset to a key and partial offset */
+static inline uint64
+encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off)
+{
+ uint64 key;
+ uint64 tid_i;
+
+ tid_i = offset | ((uint64) block << ts->control->offset_nbits);
+
+ *off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1);
+ key = tid_i >> TIDSTORE_VALUE_NBITS;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
"SharedTupleStore",
/* LWTRANCHE_SHARED_TIDBITMAP: */
"SharedTidBitmap",
+ /* LWTRANCHE_SHARED_TIDSTORE: */
+ "SharedTidStore",
/* LWTRANCHE_PARALLEL_APPEND: */
"ParallelAppend",
/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..a35a52124a
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+ BlockNumber blkno;
+ OffsetNumber *offsets;
+ int num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern int64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern size_t tidstore_max_memory(TidStore *ts);
+extern size_t tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif /* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
LWTRANCHE_SHARED_TUPLESTORE,
LWTRANCHE_SHARED_TIDBITMAP,
+ LWTRANCHE_SHARED_TIDSTORE,
LWTRANCHE_PARALLEL_APPEND,
LWTRANCHE_PER_XACT_PREDICATE_LIST,
LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
test_rls_hooks \
test_shm_mq \
test_slru \
+ test_tidstore \
unsafe_tests \
worker_spi
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
subdir('test_rls_hooks')
subdir('test_shm_mq')
subdir('test_slru')
+subdir('test_tidstore')
subdir('unsafe_tests')
subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+ $(WIN32RES) \
+ test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE: testing empty tidstore
+NOTICE: testing basic operations
+ test_tidstore
+---------------
+
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+ 'test_tidstore.c',
+)
+
+if host_system == 'windows'
+ test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_tidstore',
+ '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+ test_tidstore_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+ 'test_tidstore.control',
+ 'test_tidstore--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_tidstore',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_tidstore',
+ ],
+ },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..9b849ae8e8
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,195 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ * Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+ ItemPointerData tid;
+ bool found;
+
+ ItemPointerSet(&tid, blkno, off);
+
+ found = tidstore_lookup_tid(ts, &tid);
+
+ if (found != expect)
+ elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+ blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS 5
+#define TEST_TIDSTORE_NUM_OFFSETS 5
+
+ TidStore *ts;
+ TidStoreIter *iter;
+ TidStoreIterResult *iter_result;
+ BlockNumber blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+ };
+ BlockNumber blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+ };
+ OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+ int blk_idx;
+
+ /* prepare the offset array */
+ offs[0] = FirstOffsetNumber;
+ offs[1] = FirstOffsetNumber + 1;
+ offs[2] = max_offset / 2;
+ offs[3] = max_offset - 1;
+ offs[4] = max_offset;
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+
+ /* add tids */
+ for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+ tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* lookup test */
+ for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+ {
+ bool expect = false;
+ for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+ {
+ if (offs[i] == off)
+ {
+ expect = true;
+ break;
+ }
+ }
+
+ check_tid(ts, 0, off, expect);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, expect);
+ }
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+ elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+ tidstore_num_tids(ts),
+ TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* iteration test */
+ iter = tidstore_begin_iterate(ts);
+ blk_idx = 0;
+ while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+ {
+ /* check the returned block number */
+ if (blks_sorted[blk_idx] != iter_result->blkno)
+ elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+ iter_result->blkno, blks_sorted[blk_idx]);
+
+ /* check the returned offset numbers */
+ if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+ elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+ iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+ for (int i = 0; i < iter_result->num_offsets; i++)
+ {
+ if (offs[i] != iter_result->offsets[i])
+ elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+ iter_result->offsets[i], iter_result->blkno, offs[i]);
+ }
+
+ blk_idx++;
+ }
+
+ if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+ elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+ blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+ /* remove all tids */
+ tidstore_reset(ts);
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+ /* lookup test for empty store */
+ for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+ off++)
+ {
+ check_tid(ts, 0, off, false);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, false);
+ }
+
+ tidstore_destroy(ts);
+}
+
+static void
+test_empty(void)
+{
+ TidStore *ts;
+ TidStoreIter *iter;
+ ItemPointerData tid;
+
+ elog(NOTICE, "testing empty tidstore");
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+
+ ItemPointerSet(&tid, 0, FirstOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+ ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+ MaxBlockNumber, MaxOffsetNumber);
+
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+ if (tidstore_is_full(ts))
+ elog(ERROR, "tidstore_is_full on empty store returned true");
+
+ iter = tidstore_begin_iterate(ts);
+
+ if (tidstore_iterate_next(iter) != NULL)
+ elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+ tidstore_end_iterate(iter);
+
+ tidstore_destroy(ts);
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ elog(NOTICE, "testing basic operations");
+ test_basic(MaxHeapTuplesPerPage);
+ test_basic(10);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
--
2.39.1
v28-0003-Add-radixtree-template.patchtext/x-patch; charset=US-ASCII; name=v28-0003-Add-radixtree-template.patchDownload
From bf9d659187537b250683af321b0167d69c7fb18a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v28 03/10] Add radixtree template
WIP: commit message based on template comments
---
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 2516 +++++++++++++++++
src/include/lib/radixtree_delete_impl.h | 122 +
src/include/lib/radixtree_insert_impl.h | 328 +++
src/include/lib/radixtree_iter_impl.h | 153 +
src/include/lib/radixtree_search_impl.h | 138 +
src/include/utils/dsa.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 36 +
src/test/modules/test_radixtree/meson.build | 35 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 674 +++++
.../test_radixtree/test_radixtree.control | 4 +
src/tools/pginclude/cpluspluscheck | 6 +
src/tools/pginclude/headerscheck | 6 +
20 files changed, 4082 insertions(+)
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/include/lib/radixtree_delete_impl.h
create mode 100644 src/include/lib/radixtree_insert_impl.h
create mode 100644 src/include/lib/radixtree_iter_impl.h
create mode 100644 src/include/lib/radixtree_search_impl.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..80555aefff 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..1cdb995e54
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2516 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Template for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ * tional leaf node type which stores one value.
+ * - Multi-value leaves: The values are stored in one of four
+ * different leaf node types, which mirror the structure of
+ * inner nodes, but contain values instead of pointers.
+ * - Combined pointer/value slots: If values fit into point-
+ * ers, no separate node types are necessary. Instead, each
+ * pointer storage location in an inner node can either
+ * store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * To handle concurrency, we use a single reader-writer lock for the radix
+ * tree. The radix tree is exclusively locked during write operations such
+ * as RT_SET() and RT_DELETE(), and shared locked during read operations
+ * such as RT_SEARCH(). An iteration also holds the shared lock on the radix
+ * tree until it is completed.
+ *
+ * TODO: The current locking mechanism is not optimized for high concurrency
+ * with mixed read-write workloads. In the future it might be worthwhile
+ * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
+ * the paper "The ART of Practical Synchronization" by the same authors as
+ * the ART paper, 2016.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included. Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * will result in radix tree type 'foo_radix_tree' and functions like
+ * 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ * generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ * declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
+ *
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ * so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE - Create a new, empty radix tree
+ * RT_FREE - Free the radix tree
+ * RT_SEARCH - Search a key-value pair
+ * RT_SET - Set a key-value pair
+ * RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT - Return next key-value pair, if any
+ * RT_END_ITER - End iteration
+ * RT_MEMORY_USAGE - Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH - Attach to the radix tree
+ * RT_DETACH - Detach from the radix tree
+ * RT_GET_HANDLE - Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE - Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif /* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define RT_BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
+#define RT_BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ * statements.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ * in the future to tag the node pointer with the kind, even on
+ * platforms with 32-bit pointers. This might speed up node traversal
+ * in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_3 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Max capacity for the current size class. Storing this in the
+ * node enables multiple size classes per node kind.
+ * Technically, kinds with a single size class don't need this, so we could
+ * keep this in the individual base types, but the code is simpler this way.
+ * Note: node256 is unique in that it cannot possibly have more than a
+ * single size class, so for that kind we store zero, and uint8 is
+ * sufficient for other kinds.
+ */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree) LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree) LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree) LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree) ((void) 0)
+#define RT_LOCK_SHARED(tree) ((void) 0)
+#define RT_UNLOCK(tree) ((void) 0)
+#endif
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+#define RT_NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
+
+#define RT_NODE_MUST_GROW(node) \
+ ((node)->base.n.count == (node)->base.n.fanout)
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_3
+{
+ RT_NODE n;
+
+ /* 3 children, for key chunks */
+ uint8 chunks[3];
+} RT_NODE_BASE_3;
+
+typedef struct RT_NODE_BASE_32
+{
+ RT_NODE n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_125
+{
+ RT_NODE n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* bitmap to track which slots are in use */
+ bitmapword isset[RT_BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+ RT_NODE n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * These are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_3
+{
+ RT_NODE_BASE_3 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_3;
+
+typedef struct RT_NODE_LEAF_3
+{
+ RT_NODE_BASE_3 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_3;
+
+typedef struct RT_NODE_INNER_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+ RT_NODE_BASE_256 base;
+
+ /* Slots for 256 children */
+ RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+ RT_NODE_BASE_256 base;
+
+ /*
+ * Unlike with inner256, zero is a valid value here, so we use a
+ * bitmap to track which slots are in use.
+ */
+ bitmapword isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ RT_VALUE_TYPE values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+ RT_CLASS_3 = 0,
+ RT_CLASS_32_MIN,
+ RT_CLASS_32_MAX,
+ RT_CLASS_125,
+ RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+} RT_SIZE_CLASS_ELEM;
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+ [RT_CLASS_3] = {
+ .name = "radix tree node 3",
+ .fanout = 3,
+ .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_32_MIN] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_32_MAX] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_125] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(RT_NODE_INNER_256),
+ .leaf_size = sizeof(RT_NODE_LEAF_256),
+ },
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+ RT_HANDLE handle;
+ uint32 magic;
+ LWLock lock;
+#endif
+
+ RT_PTR_ALLOC root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+ MemoryContext context;
+
+ /* pointing to either local memory or DSA */
+ RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+ dsa_area *dsa;
+#else
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+ RT_PTR_LOCAL node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+ RT_RADIX_TREE *tree;
+
+ /* Track the iteration on nodes of each level */
+ RT_NODE_ITER stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is constructed during iteration */
+ uint64 key;
+} RT_ITER;
+
+
+static void RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_VALUE_TYPE *value_p);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+ return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+ return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+ return DsaPointerIsValid(ptr);
+#else
+ return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ /* replicate the search key */
+ spread_chunk = vector8_broadcast(chunk);
+
+ /* compare to all 32 keys stored in the node */
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+
+ /* convert comparison to a bitfield */
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+ /* mask off invalid entries */
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ /* convert bitfield to index by counting trailing zeros */
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ /*
+ * This is coded with '>=' to match what we can do with SIMD,
+ * with an assert to keep us honest.
+ */
+ if (node->chunks[index] >= chunk)
+ {
+ Assert(node->chunks[index] != chunk);
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ /*
+ * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+ * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+ * we need to play some trickery using vector8_min() to effectively get
+ * >=. There'll never be any equal elements in current uses, but that's
+ * what we get here...
+ */
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+ uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+ uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+ Assert(RT_NODE_IS_LEAF(node));
+ Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+ return node->children[chunk];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ Assert(RT_NODE_IS_LEAF(node));
+ Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ node->isset[idx] |= ((bitmapword) 1 << bitnum);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+ if (key == 0)
+ return 0;
+ else
+ return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+ RT_PTR_ALLOC allocnode;
+ size_t allocsize;
+
+ if (is_leaf)
+ allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+ else
+ allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+ allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+ if (is_leaf)
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ allocsize);
+ else
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+ allocsize);
+#endif
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->ctl->cnt[size_class]++;
+#endif
+
+ return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+ if (is_leaf)
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+ else
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+
+ node->kind = kind;
+
+ if (kind == RT_NODE_KIND_256)
+ /* See comment for the RT_NODE type */
+ Assert(node->fanout == 0);
+ else
+ node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+ memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
+ }
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static pg_noinline void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+ int shift = RT_KEY_GET_SHIFT(key);
+ bool is_leaf = shift == 0;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ newnode->shift = shift;
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+ tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->count = oldnode->count;
+}
+
+/*
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
+ */
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+ uint8 new_kind, uint8 new_class, bool is_leaf)
+{
+ RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
+ RT_COPY_NODE(newnode, node);
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->ctl->root == allocnode)
+ {
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+ tree->ctl->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->ctl->cnt[i]--;
+ Assert(tree->ctl->cnt[i] >= 0);
+ }
+#endif
+
+#ifdef RT_SHMEM
+ dsa_free(tree->dsa, allocnode);
+#else
+ pfree(allocnode);
+#endif
+}
+
+/* Update the parent's pointer when growing a node */
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static inline void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
+ RT_PTR_ALLOC new_child, uint64 key)
+{
+#ifdef USE_ASSERT_CHECKING
+ RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+ Assert(old_child->shift == new->shift);
+ Assert(old_child->count == new->count);
+#endif
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new larger node */
+ tree->ctl->root = new_child;
+ }
+ else
+ RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+ RT_FREE_NODE(tree, stored_old_child);
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static pg_noinline void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+ int target_shift;
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ int shift = root->shift + RT_NODE_SPAN;
+
+ target_shift = RT_KEY_GET_SHIFT(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ RT_NODE_INNER_3 *n3;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+ node->shift = shift;
+ node->count = 1;
+
+ n3 = (RT_NODE_INNER_3 *) node;
+ n3->base.chunks[0] = 0;
+ n3->children[0] = tree->ctl->root;
+
+ /* Update the root */
+ tree->ctl->root = allocnode;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static pg_noinline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
+{
+ int shift = node->shift;
+
+ Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ RT_PTR_ALLOC allocchild;
+ RT_PTR_LOCAL newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool is_leaf = newshift == 0;
+
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ newchild->shift = newshift;
+ RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
+
+ parent = node;
+ node = newchild;
+ stored_node = allocchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value_p);
+ tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static void
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+ RT_RADIX_TREE *tree;
+ MemoryContext old_ctx;
+#ifdef RT_SHMEM
+ dsa_pointer dp;
+#endif
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+ tree->context = ctx;
+
+#ifdef RT_SHMEM
+ tree->dsa = dsa;
+ dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+ tree->ctl->handle = dp;
+ tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+ LWLockInitialize(&tree->ctl->lock, tranche_id);
+#else
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+ /* Create a slab context for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+ size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+ size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ size_class.name,
+ inner_blocksize,
+ size_class.inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ size_class.name,
+ leaf_blocksize,
+ size_class.leaf_size);
+ }
+#endif
+
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+ RT_RADIX_TREE *tree;
+ dsa_pointer control;
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ tree->dsa = dsa;
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();
+
+ /* The leaf node doesn't have child pointers */
+ if (RT_NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->dsa, ptr);
+ return;
+ }
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+ for (int i = 0; i < n3->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n3->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ for (int i = 0; i < n32->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+ }
+
+ break;
+ }
+ }
+
+ /* Free the inner node */
+ dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ /* Free all memory used for radix tree nodes */
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_FREE_RECURSE(tree, tree->ctl->root);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->dsa, tree->ctl->handle);
+#else
+ pfree(tree->ctl);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+#endif
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+ int shift;
+ bool updated;
+ RT_PTR_LOCAL parent;
+ RT_PTR_ALLOC stored_child;
+ RT_PTR_LOCAL child;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ RT_LOCK_EXCLUSIVE(tree);
+
+ /* Empty tree, create the root */
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_NEW_ROOT(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->ctl->max_val)
+ RT_EXTEND(tree, key);
+
+ stored_child = tree->ctl->root;
+ parent = RT_PTR_GET_LOCAL(tree, stored_child);
+ shift = parent->shift;
+
+ /* Descend the tree until we reach a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;;
+
+ child = RT_PTR_GET_LOCAL(tree, stored_child);
+
+ if (RT_NODE_IS_LEAF(child))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
+ {
+ RT_SET_EXTEND(tree, key, value_p, parent, stored_child, child);
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ parent = child;
+ stored_child = new_child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value_p);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->ctl->num_keys++;
+
+ RT_UNLOCK(tree);
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *value_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+ bool found;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+ Assert(value_p != NULL);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ shift = node->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+ if (RT_NODE_IS_LEAF(node))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ node = RT_PTR_GET_LOCAL(tree, child);
+ shift -= RT_NODE_SPAN;
+ }
+
+ found = RT_NODE_SEARCH_LEAF(node, key, value_p);
+
+ RT_UNLOCK(tree);
+ return found;
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ RT_LOCK_EXCLUSIVE(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+ /* Push the current node to the stack */
+ stack[++level] = allocnode;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ allocnode = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_LEAF(node, key);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->ctl->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (node->count > 0)
+ {
+ RT_UNLOCK(tree);
+ return true;
+ }
+
+ /* Free the empty leaf node */
+ RT_FREE_NODE(tree, allocnode);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ allocnode = stack[level--];
+
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_INNER(node, key);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (node->count > 0)
+ break;
+
+ /* The node became empty */
+ RT_FREE_NODE(tree, allocnode);
+ }
+
+ RT_UNLOCK(tree);
+ return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+ RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+ int level = from;
+ RT_PTR_LOCAL node = from_node;
+
+ for (;;)
+ {
+ RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (RT_NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/*
+ * Create and return the iterator for the given radix tree.
+ *
+ * The radix tree is locked in shared mode during the iteration, so
+ * RT_END_ITERATE needs to be called when finished to release the lock.
+ */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+ MemoryContext old_ctx;
+ RT_ITER *iter;
+ RT_PTR_LOCAL root;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+ iter->tree = tree;
+
+ RT_LOCK_SHARED(tree);
+
+ /* empty tree */
+ if (!iter->tree->ctl->root)
+ return iter;
+
+ root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+ top_level = root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->ctl->root)
+ return false;
+
+ for (;;)
+ {
+ RT_PTR_LOCAL child = NULL;
+ RT_VALUE_TYPE value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+/*
+ * Terminate the iteration and release the lock.
+ *
+ * This function needs to be called after finishing or when exiting an
+ * iteration.
+ */
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+#ifdef RT_SHMEM
+ Assert(LWLockHeldByMe(&iter->tree->ctl->lock));
+#endif
+
+ RT_UNLOCK(iter->tree);
+ pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+ Size total = 0;
+
+ RT_LOCK_SHARED(tree);
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ total = dsa_get_total_size(tree->dsa);
+#else
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+#endif
+
+ RT_UNLOCK(tree);
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
+
+ for (int i = 1; i < n3->n.count; i++)
+ Assert(n3->chunks[i - 1] < n3->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ uint8 slot = n125->slot_idxs[i];
+ int idx = RT_BM_IDX(slot);
+ int bitnum = RT_BM_BIT(slot);
+
+ if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(slot < node->fanout);
+ Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
+ cnt += bmw_popcount(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+ RT_LOCK_SHARED(tree);
+
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+ fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+ fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+
+ fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+ root->shift / RT_NODE_SPAN,
+ tree->ctl->cnt[RT_CLASS_3],
+ tree->ctl->cnt[RT_CLASS_32_MIN],
+ tree->ctl->cnt[RT_CLASS_32_MAX],
+ tree->ctl->cnt[RT_CLASS_125],
+ tree->ctl->cnt[RT_CLASS_256]);
+ }
+
+ RT_UNLOCK(tree);
+}
+
+static void
+RT_DUMP_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, int level,
+ bool recurse, StringInfo buf)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+ StringInfoData spaces;
+
+ initStringInfo(&spaces);
+ appendStringInfoSpaces(&spaces, (level * 4) + 1);
+
+ appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u, shift %u:\n",
+ spaces.data,
+ level == 0 ? "" : "-> ",
+ RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_3) ? 3 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+ spaces.data, i, n3->base.chunks[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X",
+ spaces.data, i, n3->base.chunks[i]);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, n3->children[i], level + 1,
+ recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+ spaces.data, i, n32->base.chunks[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X",
+ spaces.data, i, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, n32->children[i], level + 1,
+ recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+ char *sep = "";
+
+ appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ appendStringInfo(buf, "%s[%d]=%d ",
+ sep, i, b125->slot_idxs[i]);
+ sep = ",";
+ }
+
+ appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+ for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+ appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+ appendStringInfo(buf, "\n");
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ if (RT_NODE_IS_LEAF(node))
+ appendStringInfo(buf, "%schunk 0x%X\n",
+ spaces.data, i);
+ else
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+ appendStringInfo(buf, "%schunk 0x%X",
+ spaces.data, i);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i),
+ level + 1, recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+ for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+ appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+ appendStringInfo(buf, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ appendStringInfo(buf, "%schunk 0x%X\n",
+ spaces.data, i);
+ }
+ else
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ appendStringInfo(buf, "%schunk 0x%X",
+ spaces.data, i);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i),
+ level + 1, recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ StringInfoData buf;
+ int shift;
+ int level = 0;
+
+ RT_STATS(tree);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ if (key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+ key, key);
+ return;
+ }
+
+ initStringInfo(&buf);
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child;
+
+ RT_DUMP_NODE(tree, allocnode, level, false, &buf);
+
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_VALUE_TYPE dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+ break;
+ }
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ break;
+
+ allocnode = child;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+ RT_UNLOCK(tree);
+
+ fprintf(stderr, "%s", buf.data);
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+ StringInfoData buf;
+
+ RT_STATS(tree);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ initStringInfo(&buf);
+
+ RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+ RT_UNLOCK(tree);
+
+ fprintf(stderr, "%s",buf.data);
+}
+#endif
+
+#endif /* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef RT_BM_IDX
+#undef RT_BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_3
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_3
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_3
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
+#undef RT_CLASS_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_SWITCH_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..5f6dda1f12
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,122 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_delete_impl.h
+ * Common implementation for deletion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ * TODO: Shrink nodes when deletion would allow them to fit in a smaller
+ * size class.
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_delete_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+ n3->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+ n3->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
+ n32->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int idx;
+ int bitnum;
+
+ if (slotpos == RT_INVALID_SLOT_IDX)
+ return false;
+
+ idx = RT_BM_IDX(slotpos);
+ bitnum = RT_BM_BIT(slotpos);
+ n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+ n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+ RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+ break;
+ }
+ }
+
+ /* update statistics */
+ node->count--;
+
+ return true;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..d56e58dcac
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,328 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_insert_impl.h
+ * Common implementation for insertion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_insert_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ const bool is_leaf = true;
+ bool chunk_exists = false;
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+ const bool is_leaf = false;
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ int idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n3->values[idx] = *value_p;
+ break;
+ }
+#endif
+ if (unlikely(RT_NODE_MUST_GROW(n3)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE32_TYPE *new32;
+ const uint8 new_kind = RT_NODE_KIND_32;
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
+
+ /* grow node from 3 to 32 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
+ new32->base.chunks, new32->values);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
+ new32->base.chunks, new32->children);
+#endif
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+ int count = n3->base.n.count;
+
+ /* shift chunks and children */
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
+ count, insertpos);
+#endif
+ }
+
+ n3->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n3->values[insertpos] = *value_p;
+#else
+ n3->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ int idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->values[idx] = *value_p;
+ break;
+ }
+#endif
+ if (unlikely(RT_NODE_MUST_GROW(n32)) &&
+ n32->base.n.fanout < class32_max.fanout)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+ Assert(n32->base.n.fanout == class32_min.fanout);
+
+ /* grow to the next size class of this kind */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ n32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ memcpy(newnode, node, class32_min.leaf_size);
+#else
+ memcpy(newnode, node, class32_min.inner_size);
+#endif
+ newnode->fanout = class32_max.fanout;
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n32)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE125_TYPE *new125;
+ const uint8 new_kind = RT_NODE_KIND_125;
+ const RT_SIZE_CLASS new_class = RT_CLASS_125;
+
+ Assert(n32->base.n.fanout == class32_max.fanout);
+
+ /* grow node from 32 to 125 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new125 = (RT_NODE125_TYPE *) newnode;
+
+ for (int i = 0; i < class32_max.fanout; i++)
+ {
+ new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ new125->values[i] = n32->values[i];
+#else
+ new125->children[i] = n32->children[i];
+#endif
+ }
+
+ /*
+ * Since we just copied a dense array, we can set the bits
+ * using a single store, provided the length of that array
+ * is at most the number of bits in a bitmapword.
+ */
+ Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+ new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+ int count = n32->base.n.count;
+
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+ count, insertpos);
+#endif
+ }
+
+ n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = *value_p;
+#else
+ n32->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos;
+ int cnt = 0;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ slotpos = n125->base.slot_idxs[chunk];
+ if (slotpos != RT_INVALID_SLOT_IDX)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n125->values[slotpos] = *value_p;
+ break;
+ }
+#endif
+ if (unlikely(RT_NODE_MUST_GROW(n125)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE256_TYPE *new256;
+ const uint8 new_kind = RT_NODE_KIND_256;
+ const RT_SIZE_CLASS new_class = RT_CLASS_256;
+
+ /* grow node from 125 to 256 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new256 = (RT_NODE256_TYPE *) newnode;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+ RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ cnt++;
+ }
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int idx;
+ bitmapword inverse;
+
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < RT_BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+ {
+ if (n125->base.isset[idx] < ~((bitmapword) 0))
+ break;
+ }
+
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(n125->base.isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+ Assert(slotpos < node->fanout);
+
+ /* mark the slot used */
+ n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+ n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = *value_p;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+ Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
+ RT_NODE_LEAF_256_SET(n256, chunk, *value_p);
+#else
+ Assert(node->count < RT_NODE_MAX_SLOTS);
+ RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+ break;
+ }
+ }
+
+ /* Update statistics */
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!chunk_exists)
+ node->count++;
+#else
+ node->count++;
+#endif
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ RT_VERIFY_NODE(node);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return chunk_exists;
+#else
+ return;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..98c78eb237
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,153 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_iter_impl.h
+ * Common implementation for iteration in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_iter_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ bool found = false;
+ uint8 key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_VALUE_TYPE value;
+
+ Assert(RT_NODE_IS_LEAF(node_iter->node));
+#else
+ RT_PTR_LOCAL child = NULL;
+
+ Assert(!RT_NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+ Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n3->base.n.count)
+ break;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n3->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+#endif
+ key_chunk = n3->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+#ifdef RT_NODE_LEVEL_LEAF
+ if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+ if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = value;
+#endif
+ }
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return found;
+#else
+ return child;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..a8925c75d0
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,138 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_search_impl.h
+ * Common implementation for search in leaf and inner nodes, plus
+ * update for inner nodes only.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_search_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(value_p != NULL);
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+ Assert(child_p != NULL);
+#endif
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n3->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = n3->values[idx];
+#else
+ *child_p = n3->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n32->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = n32->values[idx];
+#else
+ *child_p = n32->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+ Assert(slotpos != RT_INVALID_SLOT_IDX);
+ n125->children[slotpos] = new_child;
+#else
+ if (slotpos == RT_INVALID_SLOT_IDX)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+ *child_p = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+ RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+ *child_p = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ }
+
+#ifdef RT_ACTION_UPDATE
+ return;
+#else
+ return true;
+#endif /* RT_ACTION_UPDATE */
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..2af215484f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,6 +121,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
test_pg_db_role_setting \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
subdir('test_pg_db_role_setting')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..f944945db9
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,674 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int rt_node_kind_fanouts[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 125, /* RT_NODE_KIND_125 */
+ 256 /* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ TestValueType dummy;
+ uint64 key;
+ TestValueType val;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ rt_radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* look up keys */
+ for (int i = 0; i < children; i++)
+ {
+ TestValueType value;
+
+ if (!rt_search(radixtree, keys[i], &value))
+ elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (value != (TestValueType) keys[i])
+ elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+ value, (TestValueType) keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ TestValueType update = keys[i] + 1;
+ if (!rt_set(radixtree, keys[i], (TestValueType*) &update))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ TestValueType val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != (TestValueType) key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, (TestValueType*) &key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx - 1]
+ : rt_node_kind_fanouts[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx]
+ : rt_node_kind_fanouts[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+#else
+ radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, (TestValueType*) &x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ TestValueType v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != (TestValueType) x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ TestValueType val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != (TestValueType) expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ TestValueType v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ rt_free(radixtree);
+ MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ test_basic(rt_node_kind_fanouts[i], false);
+ test_basic(rt_node_kind_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
--
2.39.1
v28-0006-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchtext/x-patch; charset=US-ASCII; name=v28-0006-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchDownload
From b0515a40b3aa4709047c7b70b9c0cadded979d15 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v28 06/10] Tool for measuring radix tree and tidstore
performance
Includes Meson support, but commented out to avoid warnings
XXX: Not for commit
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 87 +++
contrib/bench_radix_tree/bench_radix_tree.c | 717 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/meson.build | 33 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
contrib/meson.build | 1 +
8 files changed, 894 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/meson.build
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..fbf51c1086
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,87 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_tidstore_load(
+minblk int4,
+maxblk int4,
+OUT mem_allocated int8,
+OUT load_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..b5ad75364c
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,717 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+//#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+PG_FUNCTION_INFO_V1(bench_tidstore_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+Datum
+bench_tidstore_load(PG_FUNCTION_ARGS)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ TidStore *ts;
+ OffsetNumber *offs;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_ms;
+ TupleDesc tupdesc;
+ Datum values[2];
+ bool nulls[2] = {false};
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ offs = palloc(sizeof(OffsetNumber) * TIDS_PER_BLOCK_FOR_LOAD);
+ for (int i = 0; i < TIDS_PER_BLOCK_FOR_LOAD; i++)
+ offs[i] = i + 1; /* FirstOffsetNumber is 1 */
+
+ ts = tidstore_create(1 * 1024L * 1024L * 1024L, MaxHeapTuplesPerPage, NULL);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* load tids */
+ start_time = GetCurrentTimestamp();
+ for (BlockNumber blkno = minblk; blkno < maxblk; blkno++)
+ tidstore_add_tids(ts, blkno, offs, TIDS_PER_BLOCK_FOR_LOAD);
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_ms = secs * 1000 + usecs / 1000;
+
+ values[0] = Int64GetDatum(tidstore_memory_usage(ts));
+ values[1] = Int64GetDatum(load_ms);
+
+ tidstore_destroy(ts);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ rt_radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, &val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, &val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, &key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ int64 search_time_ms;
+ Datum values[3] = {0};
+ bool nulls[3] = {0};
+ /* from trial and error */
+ uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (!PG_ARGISNULL(1))
+ {
+ char *filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+ if (sscanf(filter_str, "0x%lX", &filter) == 0)
+ elog(ERROR, "invalid filter string %s", filter_str);
+ }
+ elog(NOTICE, "bench with filter 0x%lX", filter);
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 hash = hash64(i);
+ uint64 key = hash & filter;
+
+ rt_set(rt, key, &key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 hash = hash64(i);
+ uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+ values[2] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ uint64 key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, &key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_sparseload_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ uint64 r,
+ h;
+ uint64 key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ key_id = 0;
+
+ for (r = 1; r <= fanout; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (h);
+
+ key_id++;
+ rt_set(rt, key, &key_id);
+ }
+ }
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+
+ /* measure sparse deletion and re-loading */
+ start_time = GetCurrentTimestamp();
+
+ for (int t = 0; t<10000; t++)
+ {
+ /* delete one key in each leaf */
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ rt_delete(rt, key);
+ }
+
+ /* add them all back */
+ key_id = 0;
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ key_id++;
+ rt_set(rt, key, &key_id);
+ }
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_sparseload_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+ 'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+ bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'bench_radix_tree',
+ '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+ bench_radix_tree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+ 'bench_radix_tree.control',
+ 'bench_radix_tree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'bench_radix_tree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'bench_radix_tree',
+ ],
+ },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
subdir('auth_delay')
subdir('auto_explain')
subdir('basic_archive')
+#subdir('bench_radix_tree')
subdir('bloom')
subdir('basebackup_to_shell')
subdir('bool_plperl')
--
2.39.1
v28-0007-Prevent-inlining-of-interface-functions-for-shme.patchtext/x-patch; charset=US-ASCII; name=v28-0007-Prevent-inlining-of-interface-functions-for-shme.patchDownload
From 54ab02eb2188382185436059ff6e7ad95d970c5d Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 17:00:31 +0700
Subject: [PATCH v28 07/10] Prevent inlining of interface functions for shmem
---
src/backend/access/common/tidstore.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index ad8c0866e2..d1b4675ea4 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -84,7 +84,7 @@
#define RT_PREFIX shared_rt
#define RT_SHMEM
-#define RT_SCOPE static
+#define RT_SCOPE static pg_noinline
#define RT_DECLARE
#define RT_DEFINE
#define RT_VALUE_TYPE uint64
--
2.39.1
v28-0009-Speed-up-tidstore_iter_extract_tids.patchtext/x-patch; charset=US-ASCII; name=v28-0009-Speed-up-tidstore_iter_extract_tids.patchDownload
From 8ccc66211973bcc44a6bad45c05302ca743c1489 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 17:53:37 +0700
Subject: [PATCH v28 09/10] Speed up tidstore_iter_extract_tids()
---
src/backend/access/common/tidstore.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index d1b4675ea4..5a897c01f7 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -632,21 +632,21 @@ tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
{
TidStoreIterResult *result = (&iter->result);
- for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+ while (val)
{
uint64 tid_i;
OffsetNumber off;
- if ((val & (UINT64CONST(1) << i)) == 0)
- continue;
-
tid_i = key << TIDSTORE_VALUE_NBITS;
- tid_i |= i;
+ tid_i |= pg_rightmost_one_pos64(val);
off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
Assert(result->num_offsets < iter->ts->control->max_offset);
result->offsets[result->num_offsets++] = off;
+
+ /* unset the rightmost bit */
+ val &= ~pg_rightmost_one64(val);
}
result->blkno = key_get_blkno(iter->ts, key);
--
2.39.1
v28-0010-Revert-building-benchmark-module-for-CI.patchtext/x-patch; charset=US-ASCII; name=v28-0010-Revert-building-benchmark-module-for-CI.patchDownload
From 42ba46f8073ee33bc5df6766f74f4c57587b070a Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 19:31:34 +0700
Subject: [PATCH v28 10/10] Revert building benchmark module for CI
---
contrib/meson.build | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/contrib/meson.build b/contrib/meson.build
index 421d469f8c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,7 +12,7 @@ subdir('amcheck')
subdir('auth_delay')
subdir('auto_explain')
subdir('basic_archive')
-subdir('bench_radix_tree')
+#subdir('bench_radix_tree')
subdir('bloom')
subdir('basebackup_to_shell')
subdir('bool_plperl')
--
2.39.1
On Tue, Feb 14, 2023 at 8:24 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Mon, Feb 13, 2023 at 2:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sat, Feb 11, 2023 at 2:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:I didn't get any closer to radix-tree regression,
Me neither. It seems that in v26, inserting chunks into node-32 is
slow but needs more analysis. I'll share if I found something
interesting.If that were the case, then the other benchmarks I ran would likely have slowed down as well, but they are the same or faster. There is one microbenchmark I didn't run before: "select * from bench_fixed_height_search(15)" (15 to reduce noise from growing size class, and despite the name it measures load time as well). Trying this now shows no difference: a few runs range 19 to 21ms in each version. That also reinforces that update_inner is fine and that the move to value pointer API didn't regress.
Changing TIDS_PER_BLOCK_FOR_LOAD to 1 to stress the tree more gives (min of 5, perf run separate from measurements):
v15 + v26 store:
mem_allocated | load_ms
---------------+---------
98202152 | 55319.71% postgres postgres [.] tidstore_add_tids
+ 31.47% postgres postgres [.] rt_set
= 51.18%20.62% postgres postgres [.] rt_node_insert_leaf
6.05% postgres postgres [.] AllocSetAlloc
4.74% postgres postgres [.] AllocSetFree
4.62% postgres postgres [.] palloc
2.23% postgres postgres [.] SlabAllocv26:
mem_allocated | load_ms
---------------+---------
98202032 | 61757.45% postgres postgres [.] tidstore_add_tids
20.67% postgres postgres [.] local_rt_node_insert_leaf
5.99% postgres postgres [.] AllocSetAlloc
3.55% postgres postgres [.] palloc
3.05% postgres postgres [.] AllocSetFree
2.05% postgres postgres [.] SlabAllocSo it seems the store itself got faster when we removed shared memory paths from the v26 store to test it against v15.
I thought to favor the local memory case in the tidstore by controlling inlining -- it's smaller and will be called much more often, so I tried the following (done in 0007)
#define RT_PREFIX shared_rt #define RT_SHMEM -#define RT_SCOPE static +#define RT_SCOPE static pg_noinlineThat brings it down to
mem_allocated | load_ms
---------------+---------
98202032 | 590
The improvement makes sense to me. I've also done the same test (with
changing TIDS_PER_BLOCK_FOR_LOAD to 1):
w/o 0007 patch:
mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 334 | 445
(1 row)
w/ 0007 patch:
mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 316 | 434
(1 row)
On the other hand, with TIDS_PER_BLOCK_FOR_LOAD being 30, the load
performance didn't improve:
w/0 0007 patch:
mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 601 | 608
(1 row)
w/ 0007 patch:
mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 610 | 606
(1 row)
That being said, it might be within noise level, so I agree with 0007 patch.
Perhaps some slowdown is unavoidable, but it would be nice to understand why.
True.
I can think that something like traversing a HOT chain could visit
offsets out of order. But fortunately we prune such collected TIDs
before heap vacuum in heap case.Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just continue assuming that (with an assert added since it's more public in this form). I'm not sure why such basic common sense evaded me a few versions ago...
Right. TidStore is implemented not only for heap, so loading
out-of-order TIDs might be important in the future.
If these are acceptable, I can incorporate them into a later patchset.
These are nice improvements! I agree with all changes.
Great, I've squashed these into the tidstore patch (0004). Also added 0005, which is just a simplification.
I've attached some small patches to improve the radix tree and tidstrore:
We have the following WIP comment in test_radixtree:
// WIP: compiles with warnings because rt_attach is defined but not used
// #define RT_SHMEM
How about unsetting RT_SCOPE to suppress warnings for unused rt_attach
and friends?
FYI I've briefly tested the TidStore with blocksize = 32kb, and it
seems to work fine.
I squashed the earlier dead code removal into the radix tree patch.
Thanks!
v27-0008 measures tid store iteration performance and adds a stub function to prevent spurious warnings, so the benchmarking module can always be built.
Getting the list of offsets from the old array for a given block is always trivial, but tidstore_iter_extract_tids() is doing a huge amount of unnecessary work when TIDS_PER_BLOCK_FOR_LOAD is 1, enough to exceed the load time:
mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 589 | 915Fortunately, it's an easy fix, done in 0009.
mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 589 | 153
Cool!
I'll soon resume more cosmetic review of the tid store, but this is enough to post.
Thanks!
You removed the vacuum integration patch from v27, is there any reason for that?
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v2899-0002-Small-improvements-for-radixtree-and-tests.patch.txttext/plain; charset=US-ASCII; name=v2899-0002-Small-improvements-for-radixtree-and-tests.patch.txtDownload
From f06557689f33d9b11be1083362fcce19665b4014 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Feb 2023 12:18:22 +0900
Subject: [PATCH v2899 2/2] Small improvements for radixtree and tests.
---
src/include/lib/radixtree.h | 2 +-
src/test/modules/test_radixtree/test_radixtree.c | 13 ++++++++++---
2 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 1cdb995e54..e546bd705c 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1622,7 +1622,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
/* Descend the tree until we reach a leaf node */
while (shift >= 0)
{
- RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;;
+ RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;
child = RT_PTR_GET_LOCAL(tree, stored_child);
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index f944945db9..afe53382f3 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -107,13 +107,12 @@ static const test_spec test_specs[] = {
/* define the radix tree implementation to test */
#define RT_PREFIX rt
-#define RT_SCOPE static
+#define RT_SCOPE
#define RT_DECLARE
#define RT_DEFINE
#define RT_USE_DELETE
#define RT_VALUE_TYPE TestValueType
-// WIP: compiles with warnings because rt_attach is defined but not used
-// #define RT_SHMEM
+/* #define RT_SHMEM */
#include "lib/radixtree.h"
@@ -142,6 +141,8 @@ test_empty(void)
#ifdef RT_SHMEM
int tranche_id = LWLockNewTrancheId();
dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_radix_tree");
dsa = dsa_create(tranche_id);
radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
@@ -188,6 +189,8 @@ test_basic(int children, bool test_inner)
#ifdef RT_SHMEM
int tranche_id = LWLockNewTrancheId();
dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_radix_tree");
dsa = dsa_create(tranche_id);
#endif
@@ -358,6 +361,8 @@ test_node_types(uint8 shift)
#ifdef RT_SHMEM
int tranche_id = LWLockNewTrancheId();
dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_radix_tree");
dsa = dsa_create(tranche_id);
#endif
@@ -406,6 +411,8 @@ test_pattern(const test_spec * spec)
#ifdef RT_SHMEM
int tranche_id = LWLockNewTrancheId();
dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_radix_tree");
dsa = dsa_create(tranche_id);
#endif
--
2.31.1
v2899-0001-comment-update-and-test-the-shared-tidstore.patch.txttext/plain; charset=US-ASCII; name=v2899-0001-comment-update-and-test-the-shared-tidstore.patch.txtDownload
From f6ed6e18b2281cee96af98a39bdfc453117e6a21 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Feb 2023 12:17:59 +0900
Subject: [PATCH v2899 1/2] comment update and test the shared tidstore.
---
src/backend/access/common/tidstore.c | 19 +++-------
.../modules/test_tidstore/test_tidstore.c | 37 +++++++++++++++++--
2 files changed, 40 insertions(+), 16 deletions(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 015e3dea81..8c05e60d92 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -64,13 +64,9 @@
* |---------------------------------------------| key
*
* The maximum height of the radix tree is 5 in this case.
- *
- * If the number of bits for offset number fits in a 64-bit value, we don't
- * encode tids but directly use the block number and the offset number as key
- * and value, respectively.
*/
#define TIDSTORE_VALUE_NBITS 6 /* log(64, 2) */
-#define TIDSTORE_OFFSET_MASK ((1 << TIDSTORE_VALUE_NBITS) - 1)
+#define TIDSTORE_OFFSET_MASK ((1 << TIDSTORE_VALUE_NBITS) - 1)
/* A magic value used to identify our TidStores. */
#define TIDSTORE_MAGIC 0x826f6a10
@@ -99,9 +95,10 @@ typedef struct TidStoreControl
/* These values are never changed after creation */
size_t max_bytes; /* the maximum bytes a TidStore can use */
int max_offset; /* the maximum offset number */
- int offset_nbits; /* the number of bits required for max_offset */
- int offset_key_nbits; /* the number of bits of a offset number
- * used for the key */
+ int offset_nbits; /* the number of bits required for an offset
+ * number */
+ int offset_key_nbits; /* the number of bits of an offset number
+ * used in a key */
/* The below fields are used only in shared case */
@@ -227,10 +224,6 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
- /*
- * We use tid encoding if the number of bits for the offset number doesn't
- * fix in a value, uint64.
- */
ts->control->offset_key_nbits =
ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
@@ -379,7 +372,7 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
{
uint64 off_bit;
- /* encode the tid to key and val */
+ /* encode the tid to a key and partial offset */
key = encode_key_off(ts, blkno, offsets[i], &off_bit);
/* make sure we scanned the line pointer array in order */
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index 9b849ae8e8..9a1217f833 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -18,10 +18,13 @@
#include "miscadmin.h"
#include "storage/block.h"
#include "storage/itemptr.h"
+#include "storage/lwlock.h"
#include "utils/memutils.h"
PG_MODULE_MAGIC;
+/* #define TEST_SHARED_TIDSTORE 1 */
+
#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
PG_FUNCTION_INFO_V1(test_tidstore);
@@ -59,6 +62,18 @@ test_basic(int max_offset)
OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
int blk_idx;
+#ifdef TEST_SHARED_TIDSTORE
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_tidstore");
+ dsa = dsa_create(tranche_id);
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
+#else
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+#endif
+
/* prepare the offset array */
offs[0] = FirstOffsetNumber;
offs[1] = FirstOffsetNumber + 1;
@@ -66,8 +81,6 @@ test_basic(int max_offset)
offs[3] = max_offset - 1;
offs[4] = max_offset;
- ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
-
/* add tids */
for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
@@ -144,6 +157,10 @@ test_basic(int max_offset)
}
tidstore_destroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+ dsa_detach(dsa);
+#endif
}
static void
@@ -153,9 +170,19 @@ test_empty(void)
TidStoreIter *iter;
ItemPointerData tid;
- elog(NOTICE, "testing empty tidstore");
+#ifdef TEST_SHARED_TIDSTORE
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_tidstore");
+ dsa = dsa_create(tranche_id);
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
+#else
ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+#endif
+
+ elog(NOTICE, "testing empty tidstore");
ItemPointerSet(&tid, 0, FirstOffsetNumber);
if (tidstore_lookup_tid(ts, &tid))
@@ -180,6 +207,10 @@ test_empty(void)
tidstore_end_iterate(iter);
tidstore_destroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+ dsa_detach(dsa);
+#endif
}
Datum
--
2.31.1
On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Tue, Feb 14, 2023 at 8:24 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
I can think that something like traversing a HOT chain could visit
offsets out of order. But fortunately we prune such collected TIDs
before heap vacuum in heap case.Further, currently we *already* assume we populate the tid array in
order (for binary search), so we can just continue assuming that (with an
assert added since it's more public in this form). I'm not sure why such
basic common sense evaded me a few versions ago...
Right. TidStore is implemented not only for heap, so loading
out-of-order TIDs might be important in the future.
That's what I was probably thinking about some weeks ago, but I'm having a
hard time imagining how it would come up, even for something like the
conveyor-belt concept.
We have the following WIP comment in test_radixtree:
// WIP: compiles with warnings because rt_attach is defined but not used
// #define RT_SHMEMHow about unsetting RT_SCOPE to suppress warnings for unused rt_attach
and friends?
Sounds good to me, and the other fixes make sense as well.
FYI I've briefly tested the TidStore with blocksize = 32kb, and it
seems to work fine.
That was on my list, so great! How about the other end -- nominally we
allow 512b. (In practice it won't matter, but this would make sure I didn't
mess anything up when forcing all MaxTuplesPerPage to encode.)
You removed the vacuum integration patch from v27, is there any reason
for that?
Just an oversight.
Now for some general comments on the tid store...
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory
associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
Do we need to do anything for this todo?
It might help readability to have a concept of "off_upper/off_lower", just
so we can describe things more clearly. The key is block + off_upper, and
the value is a bitmap of all the off_lower bits. I hinted at that in my
addition of encode_key_off(). Along those lines, maybe
s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even
sure the TIDSTORE_ prefix is valuable for these local macros.
The word "value" as a variable name is pretty generic in this context, and
it might be better to call it the off_lower_bitmap, at least in some
places. The "key" doesn't have a good short term for naming, but in
comments we should make sure we're clear it's "block# + off_upper".
I'm not a fan of the name "tid_i", even as a temp variable -- maybe
"compressed_tid"?
maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/
It might be worth using typedefs for key and value type. Actually, since
key type is fixed for the foreseeable future, maybe the radix tree template
should define a key typedef?
The term "result" is probably fine within the tidstore, but as a public
name used by vacuum, it's not very descriptive. I don't have a good idea,
though.
Some files in backend/access use CamelCase for public functions, although
it's not consistent. I think doing that for tidstore would help
readability, since they would stand out from rt_* functions and vacuum
functions. It's a matter of taste, though.
I don't understand the control flow in tidstore_iterate_next(), or when
BlockNumberIsValid() is true. If this is the best way to code this, it
needs more commentary.
Some comments on vacuum:
I think we'd better get some real-world testing of this, fairly soon.
I had an idea: If it's not too much effort, it might be worth splitting it
into two parts: one that just adds the store (not caring about its memory
limits or progress reporting etc). During index scan, check both the new
store and the array and log a warning (we don't want to exit or crash,
better to try to investigate while live if possible) if the result doesn't
match. Then perhaps set up an instance and let something like TPC-C run for
a few days. The second patch would just restore the rest of the current
patch. That would help reassure us it's working as designed. Soon I plan to
do some measurements with vacuuming large tables to get some concrete
numbers that the community can get excited about.
We also want to verify that progress reporting works as designed and has no
weird corner cases.
* autovacuum_work_mem) memory space to keep track of dead TIDs. We
initially
...
+ * create a TidStore with the maximum bytes that can be used by the
TidStore.
This kind of implies that we allocate the maximum bytes upfront. I think
this sentence can be removed. We already mentioned in the previous
paragraph that we set an upper bound.
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in
%u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+ vacuumed_pages)));
I don't think the format string has to change, since num_tids was changed
back to int64 in an earlier patch version?
- * the memory space for storing dead items allocated in the DSM segment.
We
[a lot of whitespace adjustment]
+ * the shared TidStore. We launch parallel worker processes at the start of
The old comment still seems mostly ok? Maybe just s/DSM segment/DSA area/
or something else minor.
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
If we're starting from the minimum, "estimate" doesn't really describe it
anymore? Maybe "Initial size"?
What does dsa_minimum_size() work out to in practice? 1MB?
Also, I think PARALLEL_VACUUM_KEY_DSA is left over from an earlier patch.
Lastly, on the radix tree:
I find extend, set, and set_extend hard to keep straight when studying the
code. Maybe EXTEND -> EXTEND_UP , SET_EXTEND -> EXTEND_DOWN ?
RT_ITER_UPDATE_KEY is unused, but I somehow didn't notice when turning it
into a template.
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
These comments don't really help readers unfamiliar with the code. The
iteration coding in general needs clearer description.
In the test:
+ 4, /* RT_NODE_KIND_4 */
The small size was changed to 3 -- if this test needs to know the max size
for each kind (class?), I wonder why it didn't fail. Should it? Maybe we
need symbols for the various fanouts.
I also want to mention now that we better decide soon if we want to support
shrinking of nodes for v16, even if the tidstore never shrinks. We'll need
to do it at some point, but I'm not sure if doing it now would make more
work for future changes targeting highly concurrent workloads. If so, doing
it now would just be wasted work. On the other hand, someone might have a
use that needs deletion before someone else needs concurrency. Just in
case, I have a start of node-shrinking logic, but needs some work because
we need the (local pointer) parent to update to the new smaller node, just
like the growing case.
--
John Naylor
EDB: http://www.enterprisedb.com
Hi,
On 2023-02-16 16:22:56 +0700, John Naylor wrote:
On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com>
Right. TidStore is implemented not only for heap, so loading
out-of-order TIDs might be important in the future.That's what I was probably thinking about some weeks ago, but I'm having a
hard time imagining how it would come up, even for something like the
conveyor-belt concept.
We really ought to replace the tid bitmap used for bitmap heap scans. The
hashtable we use is a pretty awful data structure for it. And that's not
filled in-order, for example.
Greetings,
Andres Freund
On Thu, Feb 16, 2023 at 11:44 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-02-16 16:22:56 +0700, John Naylor wrote:
On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com>
Right. TidStore is implemented not only for heap, so loading
out-of-order TIDs might be important in the future.That's what I was probably thinking about some weeks ago, but I'm
having a
hard time imagining how it would come up, even for something like the
conveyor-belt concept.We really ought to replace the tid bitmap used for bitmap heap scans. The
hashtable we use is a pretty awful data structure for it. And that's not
filled in-order, for example.
I took a brief look at that and agree we should sometime make it work there
as well.
v26 tidstore_add_tids() appears to assume that it's only called once per
blocknumber. While the order of offsets doesn't matter there for a single
block, calling it again with the same block would wipe out the earlier
offsets, IIUC. To do an actual "add tid" where the order doesn't matter, it
seems we would need to (acquire lock if needed), read the current bitmap
and OR in the new bit if it exists, then write it back out.
That sounds slow, so it might still be good for vacuum to call a function
that passes a block and an array of offsets that are assumed ordered (as in
v28), but with a more accurate name, like tidstore_set_block_offsets().
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Feb 16, 2023 at 6:23 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Feb 14, 2023 at 8:24 PM John Naylor
<john.naylor@enterprisedb.com> wrote:I can think that something like traversing a HOT chain could visit
offsets out of order. But fortunately we prune such collected TIDs
before heap vacuum in heap case.Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just continue assuming that (with an assert added since it's more public in this form). I'm not sure why such basic common sense evaded me a few versions ago...
Right. TidStore is implemented not only for heap, so loading
out-of-order TIDs might be important in the future.That's what I was probably thinking about some weeks ago, but I'm having a hard time imagining how it would come up, even for something like the conveyor-belt concept.
We have the following WIP comment in test_radixtree:
// WIP: compiles with warnings because rt_attach is defined but not used
// #define RT_SHMEMHow about unsetting RT_SCOPE to suppress warnings for unused rt_attach
and friends?Sounds good to me, and the other fixes make sense as well.
Thanks, I merged them.
FYI I've briefly tested the TidStore with blocksize = 32kb, and it
seems to work fine.That was on my list, so great! How about the other end -- nominally we allow 512b. (In practice it won't matter, but this would make sure I didn't mess anything up when forcing all MaxTuplesPerPage to encode.)
According to the doc, the minimum block size is 1kB. It seems to work
fine with 1kB blocks.
You removed the vacuum integration patch from v27, is there any reason for that?
Just an oversight.
Now for some general comments on the tid store...
+ * TODO: The caller must be certain that no other backend will attempt to + * access the TidStore before calling this function. Other backend must + * explicitly call tidstore_detach to free up backend-local memory associated + * with the TidStore. The backend that calls tidstore_destroy must not call + * tidstore_detach. + */ +void +tidstore_destroy(TidStore *ts)Do we need to do anything for this todo?
Since it's practically no problem, I think we can live with it for
now. dshash also has the same todo.
It might help readability to have a concept of "off_upper/off_lower", just so we can describe things more clearly. The key is block + off_upper, and the value is a bitmap of all the off_lower bits. I hinted at that in my addition of encode_key_off(). Along those lines, maybe s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even sure the TIDSTORE_ prefix is valuable for these local macros.
The word "value" as a variable name is pretty generic in this context, and it might be better to call it the off_lower_bitmap, at least in some places. The "key" doesn't have a good short term for naming, but in comments we should make sure we're clear it's "block# + off_upper".
I'm not a fan of the name "tid_i", even as a temp variable -- maybe "compressed_tid"?
maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/
It might be worth using typedefs for key and value type. Actually, since key type is fixed for the foreseeable future, maybe the radix tree template should define a key typedef?
The term "result" is probably fine within the tidstore, but as a public name used by vacuum, it's not very descriptive. I don't have a good idea, though.
Some files in backend/access use CamelCase for public functions, although it's not consistent. I think doing that for tidstore would help readability, since they would stand out from rt_* functions and vacuum functions. It's a matter of taste, though.
I don't understand the control flow in tidstore_iterate_next(), or when BlockNumberIsValid() is true. If this is the best way to code this, it needs more commentary.
The attached 0008 patch addressed all above comments on tidstore.
Some comments on vacuum:
I think we'd better get some real-world testing of this, fairly soon.
I had an idea: If it's not too much effort, it might be worth splitting it into two parts: one that just adds the store (not caring about its memory limits or progress reporting etc). During index scan, check both the new store and the array and log a warning (we don't want to exit or crash, better to try to investigate while live if possible) if the result doesn't match. Then perhaps set up an instance and let something like TPC-C run for a few days. The second patch would just restore the rest of the current patch. That would help reassure us it's working as designed.
Yeah, I did a similar thing in an earlier version of tidstore patch.
Since we're trying to introduce two new components: radix tree and
tidstore, I sometimes find it hard to investigate failures happening
during lazy (parallel) vacuum due to a bug either in tidstore or radix
tree. If there is a bug in lazy vacuum, we cannot even do initdb. So
it might be a good idea to do such checks in USE_ASSERT_CHECKING (or
with another macro say DEBUG_TIDSTORE) builds. For example, TidStore
stores tids to both the radix tree and array, and checks if the
results match when lookup or iteration. It will use more memory but it
would not be a big problem in USE_ASSERT_CHECKING builds. It would
also be great if we can enable such checks on some bf animals.
Soon I plan to do some measurements with vacuuming large tables to get some concrete numbers that the community can get excited about.
Thanks!
We also want to verify that progress reporting works as designed and has no weird corner cases.
* autovacuum_work_mem) memory space to keep track of dead TIDs. We initially
...
+ * create a TidStore with the maximum bytes that can be used by the TidStore.This kind of implies that we allocate the maximum bytes upfront. I think this sentence can be removed. We already mentioned in the previous paragraph that we set an upper bound.
Agreed.
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages", - vacrel->relname, (long long) index, vacuumed_pages))); + (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages", + vacrel->relname, tidstore_num_tids(vacrel->dead_items), + vacuumed_pages)));I don't think the format string has to change, since num_tids was changed back to int64 in an earlier patch version?
I think we need to change the format to INT64_FORMAT.
- * the memory space for storing dead items allocated in the DSM segment. We [a lot of whitespace adjustment] + * the shared TidStore. We launch parallel worker processes at the start ofThe old comment still seems mostly ok? Maybe just s/DSM segment/DSA area/ or something else minor.
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */ - est_dead_items_len = vac_max_items_to_alloc_size(max_items); - shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len); + /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */ + shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);If we're starting from the minimum, "estimate" doesn't really describe it anymore? Maybe "Initial size"?
What does dsa_minimum_size() work out to in practice? 1MB?
Also, I think PARALLEL_VACUUM_KEY_DSA is left over from an earlier patch.
Right. The attached 0009 patch addressed comments on vacuum
integration except for the correctness checking.
Lastly, on the radix tree:
I find extend, set, and set_extend hard to keep straight when studying the code. Maybe EXTEND -> EXTEND_UP , SET_EXTEND -> EXTEND_DOWN ?
RT_ITER_UPDATE_KEY is unused, but I somehow didn't notice when turning it into a template.
It was used in radixtree_iter_impl.h. But I removed it as it was not necessary.
+ /* + * Set the node to the node iterator and update the iterator stack + * from this node. + */ + RT_UPDATE_ITER_STACK(iter, child, level - 1);+/* + * Update each node_iter for inner nodes in the iterator node stack. + */ +static void +RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)These comments don't really help readers unfamiliar with the code. The iteration coding in general needs clearer description.
I agree with all of the above comments. The attached 0007 patch
addressed comments on the radix tree.
In the test:
+ 4, /* RT_NODE_KIND_4 */
The small size was changed to 3 -- if this test needs to know the max size for each kind (class?), I wonder why it didn't fail. Should it? Maybe we need symbols for the various fanouts.
Since this information is used to the number of keys inserted, it
doesn't check the node kind. So we just didn't test node-3. It might
be better to expose and use both RT_SIZE_CLASS and RT_SIZE_CLASS_INFO.
I also want to mention now that we better decide soon if we want to support shrinking of nodes for v16, even if the tidstore never shrinks. We'll need to do it at some point, but I'm not sure if doing it now would make more work for future changes targeting highly concurrent workloads. If so, doing it now would just be wasted work. On the other hand, someone might have a use that needs deletion before someone else needs concurrency. Just in case, I have a start of node-shrinking logic, but needs some work because we need the (local pointer) parent to update to the new smaller node, just like the growing case.
Thanks, that's also on my todo list. TBH I'm not sure we should
improve the deletion at this stage as there is no use case of deletion
in the core. I'd prefer to focus on improving the quality of the
current radix tree and tidstore now, and I think we can support
node-shrinking once we are confident with the current implementation.
On Fri, Feb 17, 2023 at 5:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
That sounds slow, so it might still be good for vacuum to call a function that passes a block and an array of offsets that are assumed ordered (as in v28), but with a more accurate name, like tidstore_set_block_offsets().
tidstore_set_block_offsets() sounds better. I used
TidStoreSetBlockOffsets() in the latest patch set.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v29-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchapplication/octet-stream; name=v29-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload
From b9883174cb69d87e6c9fdccb33ae29d5f084cd8e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 7 Feb 2023 17:19:29 +0700
Subject: [PATCH v29 06/10] Use TIDStore for storing dead tuple TID during lazy
vacuum
Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which was not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.
Now we use TIDStore to store dead tuple TIDs. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.
Since we are no longer able to exactly estimate the maximum number of
TIDs can be stored the pg_stat_progress_vacuum shows the progress
information based on the amount of memory in bytes. The column names
are also changed to max_dead_tuple_bytes and num_dead_tuple_bytes.
In addition, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, the inital DSA
segment size. Due to that, we increase the minimum value of
maintenance_work_mem (also autovacuum_work_mem) from 1MB to 2MB.
XXX: needs to bump catalog version
---
doc/src/sgml/monitoring.sgml | 8 +-
src/backend/access/heap/vacuumlazy.c | 278 ++++++++-------------
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 78 +-----
src/backend/commands/vacuumparallel.c | 73 +++---
src/backend/postmaster/autovacuum.c | 6 +-
src/backend/storage/lmgr/lwlock.c | 2 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/commands/progress.h | 4 +-
src/include/commands/vacuum.h | 25 +-
src/include/storage/lwlock.h | 1 +
src/test/regress/expected/cluster.out | 2 +-
src/test/regress/expected/create_index.out | 2 +-
src/test/regress/expected/rules.out | 4 +-
src/test/regress/sql/cluster.sql | 2 +-
src/test/regress/sql/create_index.sql | 2 +-
16 files changed, 177 insertions(+), 314 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e28206e056..1d84e17705 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7165,10 +7165,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>max_dead_tuples</structfield> <type>bigint</type>
+ <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples that we can store before needing to perform
+ Amount of dead tuple data that we can store before needing to perform
an index vacuum cycle, based on
<xref linkend="guc-maintenance-work-mem"/>.
</para></entry>
@@ -7176,10 +7176,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>num_dead_tuples</structfield> <type>bigint</type>
+ <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples collected since the last index vacuum cycle.
+ Amount of dead tuple data collected since the last index vacuum cycle.
</para></entry>
</row>
</tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..b4e40423a8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,18 @@
* vacuumlazy.c
* Concurrent ("lazy") vacuuming.
*
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs
* that are to be removed from indexes. We want to ensure we can vacuum even
* the very largest relations with finite memory space usage. To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
*
* We are willing to use at most maintenance_work_mem (or perhaps
* autovacuum_work_mem) memory space to keep track of dead TIDs. We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables). If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * create a TidStore with the maximum bytes that can be used by the TidStore.
+ * If the TidStore is full, we must call lazy_vacuum to vacuum indexes (and to
+ * vacuum the pages that we've pruned). This frees up the memory space dedicated
+ * to storing dead TIDs.
*
* In practice VACUUM will often complete its initial pass over the target
* heap relation without ever running out of space to store TIDs. This means
@@ -40,6 +40,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TidStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -220,11 +221,14 @@ typedef struct LVRelState
typedef struct LVPagePruneState
{
bool hastup; /* Page prevents rel truncation? */
- bool has_lpdead_items; /* includes existing LP_DEAD items */
+
+ /* collected offsets of LP_DEAD items including existing ones */
+ OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+ int num_offsets;
/*
* State describes the proper VM bit states to set for the page following
- * pruning and freezing. all_visible implies !has_lpdead_items, but don't
+ * pruning and freezing. all_visible implies num_offsets == 0, but don't
* trust all_frozen result unless all_visible is also set to true.
*/
bool all_visible; /* Every item visible to all? */
@@ -259,8 +263,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -487,11 +492,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
}
/*
- * Allocate dead_items array memory using dead_items_alloc. This handles
- * parallel VACUUM initialization as part of allocating shared memory
- * space used for dead_items. (But do a failsafe precheck first, to
- * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
- * is already dangerously old.)
+ * Allocate dead_items memory using dead_items_alloc. This handles parallel
+ * VACUUM initialization as part of allocating shared memory space used for
+ * dead_items. (But do a failsafe precheck first, to ensure that parallel
+ * VACUUM won't be attempted at all when relfrozenxid is already dangerously
+ * old.)
*/
lazy_check_wraparound_failsafe(vacrel);
dead_items_alloc(vacrel, params->nworkers);
@@ -797,7 +802,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
* have collected the TIDs whose index tuples need to be removed.
*
* Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- * largely consists of marking LP_DEAD items (from collected TID array)
+ * largely consists of marking LP_DEAD items (from vacrel->dead_items)
* as LP_UNUSED. This has to happen in a second, final pass over the
* heap, to preserve a basic invariant that all index AMs rely on: no
* extant index tuple can ever be allowed to contain a TID that points to
@@ -825,21 +830,21 @@ lazy_scan_heap(LVRelState *vacrel)
blkno,
next_unskippable_block,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TidStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
const int initprog_index[] = {
PROGRESS_VACUUM_PHASE,
PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
- PROGRESS_VACUUM_MAX_DEAD_TUPLES
+ PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
};
int64 initprog_val[3];
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +911,7 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ if (tidstore_is_full(vacrel->dead_items))
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -969,7 +973,7 @@ lazy_scan_heap(LVRelState *vacrel)
continue;
}
- /* Collect LP_DEAD items in dead_items array, count tuples */
+ /* Collect LP_DEAD items in dead_items, count tuples */
if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
&recordfreespace))
{
@@ -1011,14 +1015,14 @@ lazy_scan_heap(LVRelState *vacrel)
* Prune, freeze, and count tuples.
*
* Accumulates details of remaining LP_DEAD line pointers on page in
- * dead_items array. This includes LP_DEAD line pointers that we
- * pruned ourselves, as well as existing LP_DEAD line pointers that
- * were pruned some time earlier. Also considers freezing XIDs in the
- * tuple headers of remaining items with storage.
+ * dead_items. This includes LP_DEAD line pointers that we pruned
+ * ourselves, as well as existing LP_DEAD line pointers that were pruned
+ * some time earlier. Also considers freezing XIDs in the tuple headers
+ * of remaining items with storage.
*/
lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
- Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+ Assert(!prunestate.all_visible || (prunestate.num_offsets == 0));
/* Remember the location of the last page with nonremovable tuples */
if (prunestate.hastup)
@@ -1034,14 +1038,12 @@ lazy_scan_heap(LVRelState *vacrel)
* performed here can be thought of as the one-pass equivalent of
* a call to lazy_vacuum().
*/
- if (prunestate.has_lpdead_items)
+ if (prunestate.num_offsets > 0)
{
Size freespace;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
- /* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+ prunestate.num_offsets, buf, vmbuffer);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1080,16 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(tidstore_num_tids(dead_items) == 0);
+ }
+ else if (prunestate.num_offsets > 0)
+ {
+ /* Save details of the LP_DEAD items from the page in dead_items */
+ tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
+ prunestate.num_offsets);
+
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
}
/*
@@ -1145,7 +1156,7 @@ lazy_scan_heap(LVRelState *vacrel)
* There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
* set, however.
*/
- else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+ else if ((prunestate.num_offsets > 0) && PageIsAllVisible(page))
{
elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
vacrel->relname, blkno);
@@ -1193,7 +1204,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Final steps for block: drop cleanup lock, record free space in the
* FSM
*/
- if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+ if ((prunestate.num_offsets > 0) && vacrel->do_index_vacuuming)
{
/*
* Wait until lazy_vacuum_heap_rel() to save free space. This
@@ -1249,7 +1260,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (tidstore_num_tids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1524,9 +1535,9 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
* The approach we take now is to restart pruning when the race condition is
* detected. This allows heap_page_prune() to prune the tuples inserted by
* the now-aborted transaction. This is a little crude, but it guarantees
- * that any items that make it into the dead_items array are simple LP_DEAD
- * line pointers, and that every remaining item with tuple storage is
- * considered as a candidate for freezing.
+ * that any items that make it into the dead_items are simple LP_DEAD line
+ * pointers, and that every remaining item with tuple storage is considered
+ * as a candidate for freezing.
*/
static void
lazy_scan_prune(LVRelState *vacrel,
@@ -1543,13 +1554,11 @@ lazy_scan_prune(LVRelState *vacrel,
HTSV_Result res;
int tuples_deleted,
tuples_frozen,
- lpdead_items,
live_tuples,
recently_dead_tuples;
int nnewlpdead;
HeapPageFreeze pagefrz;
int64 fpi_before = pgWalUsage.wal_fpi;
- OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1571,7 +1580,6 @@ retry:
pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
tuples_deleted = 0;
tuples_frozen = 0;
- lpdead_items = 0;
live_tuples = 0;
recently_dead_tuples = 0;
@@ -1580,9 +1588,9 @@ retry:
*
* We count tuples removed by the pruning step as tuples_deleted. Its
* final value can be thought of as the number of tuples that have been
- * deleted from the table. It should not be confused with lpdead_items;
- * lpdead_items's final value can be thought of as the number of tuples
- * that were deleted from indexes.
+ * deleted from the table. It should not be confused with
+ * prunestate->deadoffsets; prunestate->deadoffsets's final value can
+ * be thought of as the number of tuples that were deleted from indexes.
*/
tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
InvalidTransactionId, 0, &nnewlpdead,
@@ -1593,7 +1601,7 @@ retry:
* requiring freezing among remaining tuples with storage
*/
prunestate->hastup = false;
- prunestate->has_lpdead_items = false;
+ prunestate->num_offsets = 0;
prunestate->all_visible = true;
prunestate->all_frozen = true;
prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1638,7 +1646,7 @@ retry:
* (This is another case where it's useful to anticipate that any
* LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
*/
- deadoffsets[lpdead_items++] = offnum;
+ prunestate->deadoffsets[prunestate->num_offsets++] = offnum;
continue;
}
@@ -1875,7 +1883,7 @@ retry:
*/
#ifdef USE_ASSERT_CHECKING
/* Note that all_frozen value does not matter when !all_visible */
- if (prunestate->all_visible && lpdead_items == 0)
+ if (prunestate->all_visible && prunestate->num_offsets == 0)
{
TransactionId cutoff;
bool all_frozen;
@@ -1888,28 +1896,9 @@ retry:
}
#endif
- /*
- * Now save details of the LP_DEAD items from the page in vacrel
- */
- if (lpdead_items > 0)
+ if (prunestate->num_offsets > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
-
vacrel->lpdead_item_pages++;
- prunestate->has_lpdead_items = true;
-
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
/*
* It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1928,7 +1917,7 @@ retry:
/* Finally, add page-local counts to whole-VACUUM counts */
vacrel->tuples_deleted += tuples_deleted;
vacrel->tuples_frozen += tuples_frozen;
- vacrel->lpdead_items += lpdead_items;
+ vacrel->lpdead_items += prunestate->num_offsets;
vacrel->live_tuples += live_tuples;
vacrel->recently_dead_tuples += recently_dead_tuples;
}
@@ -1940,7 +1929,7 @@ retry:
* lazy_scan_prune, which requires a full cleanup lock. While pruning isn't
* performed here, it's quite possible that an earlier opportunistic pruning
* operation left LP_DEAD items behind. We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items for removal from indexes.
*
* For aggressive VACUUM callers, we may return false to indicate that a full
* cleanup lock is required for processing by lazy_scan_prune. This is only
@@ -2099,7 +2088,7 @@ lazy_scan_noprune(LVRelState *vacrel,
vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
vacrel->NewRelminMxid = NoFreezePageRelminMxid;
- /* Save any LP_DEAD items found on the page in dead_items array */
+ /* Save any LP_DEAD items found on the page in dead_items */
if (vacrel->nindexes == 0)
{
/* Using one-pass strategy (since table has no indexes) */
@@ -2129,8 +2118,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TidStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2127,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2198,7 +2179,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
return;
}
@@ -2227,7 +2208,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2254,8 +2235,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2300,7 +2281,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
}
/*
@@ -2373,7 +2354,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2392,9 +2373,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
/*
* lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
*
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
*
* We may also be able to truncate the line pointer array of the heap pages we
* visit. If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2410,10 +2390,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index = 0;
BlockNumber vacuumed_pages = 0;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2409,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
VACUUM_ERRCB_PHASE_VACUUM_HEAP,
InvalidBlockNumber, InvalidOffsetNumber);
- while (index < vacrel->dead_items->num_items)
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ while ((result = tidstore_iterate_next(iter)) != NULL)
{
BlockNumber blkno;
Buffer buf;
@@ -2437,7 +2419,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ blkno = result->blkno;
vacrel->blkno = blkno;
/*
@@ -2451,7 +2433,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+ buf, vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2461,6 +2444,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
vacuumed_pages++;
}
+ tidstore_end_iterate(iter);
vacrel->blkno = InvalidBlockNumber;
if (BufferIsValid(vmbuffer))
@@ -2470,36 +2454,31 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+ vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
}
/*
- * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- * vacrel->dead_items array.
+ * lazy_vacuum_heap_page() -- free page's LP_DEAD items.
*
* Caller must have an exclusive buffer lock on the buffer (though a full
* cleanup lock is also acceptable). vmbuffer must be valid and already have
* a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page. The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *deadoffsets, int num_offsets, Buffer buffer,
+ Buffer vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int nunused = 0;
@@ -2518,16 +2497,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = deadoffsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -2687,8 +2660,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
* lazy_vacuum_one_index() -- vacuum index relation.
*
* Delete all the index tuples containing a TID collected in
- * vacrel->dead_items array. Also update running statistics.
- * Exact details depend on index AM's ambulkdelete routine.
+ * vacrel->dead_items. Also update running statistics. Exact
+ * details depend on index AM's ambulkdelete routine.
*
* reltuples is the number of heap tuples to be passed to the
* bulkdelete callback. It's always assumed to be estimated.
@@ -3094,48 +3067,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
}
/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
-/*
- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
+ * in vacrel for caller.
*
* Also handles parallel initialization as part of allocating dead_items in
* DSM when required.
@@ -3143,11 +3076,9 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
+ int vac_work_mem = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
@@ -3174,7 +3105,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
+ vac_work_mem, MaxHeapTuplesPerPage,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3187,11 +3118,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = tidstore_create(vac_work_mem, MaxHeapTuplesPerPage,
+ NULL);
}
/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 34ca0e739f..149d41b41c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,7 +1180,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
END AS phase,
S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
- S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+ S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index aa79d9de4d..d8e680ca20 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int vac_cmp_itemptr(const void *left, const void *right);
/*
* Primary entry point for manual VACUUM and ANALYZE commands
@@ -2303,16 +2302,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TidStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ tidstore_num_tids(dead_items))));
return istat;
}
@@ -2343,82 +2342,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
* This has the right signature to be an IndexBulkDeleteCallback.
- *
- * Assumes dead_items array is sorted (in ascending TID order).
*/
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch(itemptr,
- dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
-
- return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
- BlockNumber lblk,
- rblk;
- OffsetNumber loff,
- roff;
-
- lblk = ItemPointerGetBlockNumber((ItemPointer) left);
- rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
- if (lblk < rblk)
- return -1;
- if (lblk > rblk)
- return 1;
-
- loff = ItemPointerGetOffsetNumber((ItemPointer) left);
- roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
- if (loff < roff)
- return -1;
- if (loff > roff)
- return 1;
+ TidStore *dead_items = (TidStore *) state;
- return 0;
+ return tidstore_lookup_tid(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..d653683693 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -9,12 +9,11 @@
* In a parallel vacuum, we perform both index bulk deletion and index cleanup
* with parallel worker processes. Individual indexes are processed by one
* vacuum process. ParalleVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment. We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit. Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * the shared TidStore. We launch parallel worker processes at the start of
+ * parallel index bulk-deletion and index cleanup and once all indexes are
+ * processed, the parallel worker processes exit. Each time we process indexes
+ * in parallel, the parallel context is re-initialized so that the same DSM can
+ * be used for multiple passes of index bulk-deletion and index cleanup.
*
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
@@ -103,6 +102,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TidStore */
+ tidstore_handle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -166,7 +168,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ dsa_area *dead_items_area;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -222,20 +225,23 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int max_offset, int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -283,9 +289,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -351,6 +356,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = tidstore_create(vac_work_mem, max_offset, dead_items_dsa);
+ pvs->dead_items = dead_items;
+ pvs->dead_items_area = dead_items_dsa;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -360,6 +375,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = tidstore_get_handle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +384,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -434,6 +441,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ tidstore_destroy(pvs->dead_items);
+ dsa_detach(pvs->dead_items_area);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -442,7 +452,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TidStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -940,7 +950,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -984,10 +996,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1045,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ tidstore_detach(pvs.dead_items);
+ dsa_detach(dead_items_area);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index ff6149a179..a371f6fbba 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3397,12 +3397,12 @@ check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
return true;
/*
- * We clamp manually-set values to at least 1MB. Since
+ * We clamp manually-set values to at least 2MB. Since
* maintenance_work_mem is always set to at least this value, do the same
* here.
*/
- if (*newval < 1024)
- *newval = 1024;
+ if (*newval < 2048)
+ *newval = 2048;
return true;
}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 55b3a04097..c223a7dc94 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -192,6 +192,8 @@ static const char *const BuiltinTrancheNames[] = {
"LogicalRepLauncherDSA",
/* LWTRANCHE_LAUNCHER_HASH: */
"LogicalRepLauncherHash",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index b46e3b8c55..27a88b9369 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2312,7 +2312,7 @@ struct config_int ConfigureNamesInt[] =
GUC_UNIT_KB
},
&maintenance_work_mem,
- 65536, 1024, MAX_KILOBYTES,
+ 65536, 2048, MAX_KILOBYTES,
NULL, NULL, NULL
},
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
#define PROGRESS_VACUUM_HEAP_BLKS_SCANNED 2
#define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED 3
#define PROGRESS_VACUUM_NUM_INDEX_VACUUMS 4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES 5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES 5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES 6
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..a3ebb169ef 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
MultiXactId MultiXactCutoff;
};
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TidStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int vac_work_mem, int max_offset,
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 07002fdfbe..537b34b30c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DATA,
LWTRANCHE_LAUNCHER_DSA,
LWTRANCHE_LAUNCHER_HASH,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
-- ensure we don't use the index in CLUSTER nor the checking SELECTs
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
-- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 174b725fff..8fa4e86be8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2032,8 +2032,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param3 AS heap_blks_scanned,
s.param4 AS heap_blks_vacuumed,
s.param5 AS index_vacuum_count,
- s.param6 AS max_dead_tuples,
- s.param7 AS num_dead_tuples
+ s.param6 AS max_dead_tuple_bytes,
+ s.param7 AS dead_tuple_bytes
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
--
2.31.1
v29-0007-Review-radix-tree.patchapplication/octet-stream; name=v29-0007-Review-radix-tree.patchDownload
From 52e0d50d6e882c0444ccdf15f8afcc1aef3a6987 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 20 Feb 2023 11:28:50 +0900
Subject: [PATCH v29 07/10] Review radix tree.
Mainly improve the iteration codes and comments.
---
src/include/lib/radixtree.h | 169 +++++++++---------
src/include/lib/radixtree_iter_impl.h | 85 ++++-----
.../expected/test_radixtree.out | 6 +-
.../modules/test_radixtree/test_radixtree.c | 103 +++++++----
4 files changed, 197 insertions(+), 166 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index e546bd705c..8bea606c62 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -83,7 +83,7 @@
* RT_SET - Set a key-value pair
* RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
* RT_ITERATE_NEXT - Return next key-value pair, if any
- * RT_END_ITER - End iteration
+ * RT_END_ITERATE - End iteration
* RT_MEMORY_USAGE - Get the memory usage
*
* Interface for Shared Memory
@@ -152,8 +152,8 @@
#define RT_INIT_NODE RT_MAKE_NAME(init_node)
#define RT_FREE_NODE RT_MAKE_NAME(free_node)
#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
-#define RT_EXTEND RT_MAKE_NAME(extend)
-#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_EXTEND_UP RT_MAKE_NAME(extend_up)
+#define RT_EXTEND_DOWN RT_MAKE_NAME(extend_down)
#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
@@ -191,7 +191,7 @@
#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
-#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_SET_NODE_FROM RT_MAKE_NAME(iter_set_node_from)
#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
@@ -612,7 +612,6 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
#endif
/* Contains the actual tree and ancillary info */
-// WIP: this name is a bit strange
typedef struct RT_RADIX_TREE_CONTROL
{
#ifdef RT_SHMEM
@@ -651,36 +650,40 @@ typedef struct RT_RADIX_TREE
* Iteration support.
*
* Iterating the radix tree returns each pair of key and value in the ascending
- * order of the key. To support this, the we iterate nodes of each level.
+ * order of the key.
*
- * RT_NODE_ITER struct is used to track the iteration within a node.
+ * RT_NODE_ITER is the struct for iteration of one radix tree node.
*
* RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
- * in order to track the iteration of each level. During iteration, we also
- * construct the key whenever updating the node iteration information, e.g., when
- * advancing the current index within the node or when moving to the next node
- * at the same level.
- *
- * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
- * has the local pointers to nodes, rather than RT_PTR_ALLOC.
- * We need either a safeguard to disallow other processes to begin the iteration
- * while one process is doing or to allow multiple processes to do the iteration.
+ * for each level to track the iteration within the node.
*/
typedef struct RT_NODE_ITER
{
- RT_PTR_LOCAL node; /* current node being iterated */
- int current_idx; /* current position. -1 for initial value */
+ /*
+ * Local pointer to the node we are iterating over.
+ *
+ * Since the radix tree doesn't support the shared iteration among multiple
+ * processes, we use RT_PTR_LOCAL rather than RT_PTR_ALLOC.
+ */
+ RT_PTR_LOCAL node;
+
+ /*
+ * The next index of the chunk array in RT_NODE_KIND_3 and
+ * RT_NODE_KIND_32 nodes, or the next chunk in RT_NODE_KIND_125 and
+ * RT_NODE_KIND_256 nodes. 0 for the initial value.
+ */
+ int idx;
} RT_NODE_ITER;
typedef struct RT_ITER
{
RT_RADIX_TREE *tree;
- /* Track the iteration on nodes of each level */
- RT_NODE_ITER stack[RT_MAX_LEVEL];
- int stack_len;
+ /* Track the nodes for each level. level = 0 is for a leaf node */
+ RT_NODE_ITER node_iters[RT_MAX_LEVEL];
+ int top_level;
- /* The key is constructed during iteration */
+ /* The key constructed during the iteration */
uint64 key;
} RT_ITER;
@@ -1243,7 +1246,7 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
* it can store the key.
*/
static pg_noinline void
-RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+RT_EXTEND_UP(RT_RADIX_TREE *tree, uint64 key)
{
int target_shift;
RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
@@ -1282,7 +1285,7 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
* Insert inner and leaf nodes from 'node' to bottom.
*/
static pg_noinline void
-RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+RT_EXTEND_DOWN(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
{
int shift = node->shift;
@@ -1613,7 +1616,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
/* Extend the tree if necessary */
if (key > tree->ctl->max_val)
- RT_EXTEND(tree, key);
+ RT_EXTEND_UP(tree, key);
stored_child = tree->ctl->root;
parent = RT_PTR_GET_LOCAL(tree, stored_child);
@@ -1631,7 +1634,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
{
- RT_SET_EXTEND(tree, key, value_p, parent, stored_child, child);
+ RT_EXTEND_DOWN(tree, key, value_p, parent, stored_child, child);
RT_UNLOCK(tree);
return false;
}
@@ -1805,16 +1808,9 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
}
#endif
-static inline void
-RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
-{
- iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
- iter->key |= (((uint64) chunk) << shift);
-}
-
/*
- * Advance the slot in the inner node. Return the child if exists, otherwise
- * null.
+ * Scan the inner node and return the next child node if exist, otherwise
+ * return NULL.
*/
static inline RT_PTR_LOCAL
RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
@@ -1825,8 +1821,8 @@ RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
}
/*
- * Advance the slot in the leaf node. On success, return true and the value
- * is set to value_p, otherwise return false.
+ * Scan the leaf node, and return true and the next value is set to value_p
+ * if exists. Otherwise return false.
*/
static inline bool
RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
@@ -1838,29 +1834,50 @@ RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
}
/*
- * Update each node_iter for inner nodes in the iterator node stack.
+ * While descending the radix tree from the 'from' node to the bottom, we
+ * set the next node to iterate for each level.
*/
static void
-RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+RT_ITER_SET_NODE_FROM(RT_ITER *iter, RT_PTR_LOCAL from)
{
- int level = from;
- RT_PTR_LOCAL node = from_node;
+ int level = from->shift / RT_NODE_SPAN;
+ RT_PTR_LOCAL node = from;
for (;;)
{
- RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+ RT_NODE_ITER *node_iter = &(iter->node_iters[level--]);
+
+#ifdef USE_ASSERT_CHECKING
+ if (node_iter->node)
+ {
+ /* We must have finished the iteration on the previous node */
+ if (RT_NODE_IS_LEAF(node_iter->node))
+ {
+ uint64 dummy;
+ Assert(!RT_NODE_LEAF_ITERATE_NEXT(iter, node_iter, &dummy));
+ }
+ else
+ Assert(!RT_NODE_INNER_ITERATE_NEXT(iter, node_iter));
+ }
+#endif
+ /* Set the node to the node iterator of this level */
node_iter->node = node;
- node_iter->current_idx = -1;
+ node_iter->idx = 0;
- /* We don't advance the leaf node iterator here */
if (RT_NODE_IS_LEAF(node))
- return;
+ {
+ /* We will visit the leaf node when RT_ITERATE_NEXT() */
+ break;
+ }
- /* Advance to the next slot in the inner node */
+ /*
+ * Get the first child node from the node, which corresponds to the
+ * lowest chunk within the node.
+ */
node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
- /* We must find the first children in the node */
+ /* The first child must be found */
Assert(node);
}
}
@@ -1874,14 +1891,11 @@ RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
RT_SCOPE RT_ITER *
RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
{
- MemoryContext old_ctx;
RT_ITER *iter;
RT_PTR_LOCAL root;
- int top_level;
- old_ctx = MemoryContextSwitchTo(tree->context);
-
- iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+ iter = (RT_ITER *) MemoryContextAllocZero(tree->context,
+ sizeof(RT_ITER));
iter->tree = tree;
RT_LOCK_SHARED(tree);
@@ -1891,16 +1905,13 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
return iter;
root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
- top_level = root->shift / RT_NODE_SPAN;
- iter->stack_len = top_level;
+ iter->top_level = root->shift / RT_NODE_SPAN;
/*
- * Descend to the left most leaf node from the root. The key is being
- * constructed while descending to the leaf.
+ * Set the next node to iterate for each level from the level of the
+ * root node.
*/
- RT_UPDATE_ITER_STACK(iter, root, top_level);
-
- MemoryContextSwitchTo(old_ctx);
+ RT_ITER_SET_NODE_FROM(iter, root);
return iter;
}
@@ -1912,6 +1923,8 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
RT_SCOPE bool
RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
{
+ Assert(value_p != NULL);
+
/* Empty tree */
if (!iter->tree->ctl->root)
return false;
@@ -1919,43 +1932,38 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
for (;;)
{
RT_PTR_LOCAL child = NULL;
- RT_VALUE_TYPE value;
- int level;
- bool found;
-
- /* Advance the leaf node iterator to get next key-value pair */
- found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
- if (found)
+ /* Get the next chunk of the leaf node */
+ if (RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->node_iters[0]), value_p))
{
*key_p = iter->key;
- *value_p = value;
return true;
}
/*
- * We've visited all values in the leaf node, so advance inner node
- * iterators from the level=1 until we find the next child node.
+ * We've visited all values in the leaf node, so advance all inner node
+ * iterators by visiting inner nodes from the level = 1 until we find the
+ * next inner node that has a child node.
*/
- for (level = 1; level <= iter->stack_len; level++)
+ for (int level = 1; level <= iter->top_level; level++)
{
- child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+ child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->node_iters[level]));
if (child)
break;
}
- /* the iteration finished */
+ /* We've visited all nodes, so the iteration finished */
if (!child)
- return false;
+ break;
/*
- * Set the node to the node iterator and update the iterator stack
- * from this node.
+ * Found the new child node. We update the next node to iterate for each
+ * level from the level of this child node.
*/
- RT_UPDATE_ITER_STACK(iter, child, level - 1);
+ RT_ITER_SET_NODE_FROM(iter, child);
- /* Node iterators are updated, so try again from the leaf */
+ /* Find key-value from the leaf node again */
}
return false;
@@ -2470,8 +2478,8 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_INIT_NODE
#undef RT_FREE_NODE
#undef RT_FREE_RECURSE
-#undef RT_EXTEND
-#undef RT_SET_EXTEND
+#undef RT_EXTEND_UP
+#undef RT_EXTEND_DOWN
#undef RT_SWITCH_NODE_KIND
#undef RT_COPY_NODE
#undef RT_REPLACE_NODE
@@ -2509,8 +2517,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_NODE_INSERT_LEAF
#undef RT_NODE_INNER_ITERATE_NEXT
#undef RT_NODE_LEAF_ITERATE_NEXT
-#undef RT_UPDATE_ITER_STACK
-#undef RT_ITER_UPDATE_KEY
+#undef RT_RT_ITER_SET_NODE_FROM
#undef RT_VERIFY_NODE
#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index 98c78eb237..5c1034768e 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -27,12 +27,10 @@
#error node level must be either inner or leaf
#endif
- bool found = false;
- uint8 key_chunk;
+ uint8 key_chunk = 0;
#ifdef RT_NODE_LEVEL_LEAF
- RT_VALUE_TYPE value;
-
+ Assert(value_p != NULL);
Assert(RT_NODE_IS_LEAF(node_iter->node));
#else
RT_PTR_LOCAL child = NULL;
@@ -50,99 +48,92 @@
{
RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
- node_iter->current_idx++;
- if (node_iter->current_idx >= n3->base.n.count)
- break;
+ if (node_iter->idx >= n3->base.n.count)
+ return false;
+
#ifdef RT_NODE_LEVEL_LEAF
- value = n3->values[node_iter->current_idx];
+ *value_p = n3->values[node_iter->idx];
#else
- child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+ child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->idx]);
#endif
- key_chunk = n3->base.chunks[node_iter->current_idx];
- found = true;
+ key_chunk = n3->base.chunks[node_iter->idx];
+ node_iter->idx++;
break;
}
case RT_NODE_KIND_32:
{
RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
- node_iter->current_idx++;
- if (node_iter->current_idx >= n32->base.n.count)
- break;
+ if (node_iter->idx >= n32->base.n.count)
+ return false;
#ifdef RT_NODE_LEVEL_LEAF
- value = n32->values[node_iter->current_idx];
+ *value_p = n32->values[node_iter->idx];
#else
- child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+ child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->idx]);
#endif
- key_chunk = n32->base.chunks[node_iter->current_idx];
- found = true;
+ key_chunk = n32->base.chunks[node_iter->idx];
+ node_iter->idx++;
break;
}
case RT_NODE_KIND_125:
{
RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
- int i;
+ int chunk;
- for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ for (chunk = node_iter->idx; chunk < RT_NODE_MAX_SLOTS; chunk++)
{
- if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+ if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, chunk))
break;
}
- if (i >= RT_NODE_MAX_SLOTS)
- break;
+ if (chunk >= RT_NODE_MAX_SLOTS)
+ return false;
- node_iter->current_idx = i;
#ifdef RT_NODE_LEVEL_LEAF
- value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+ *value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
#else
- child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, chunk));
#endif
- key_chunk = i;
- found = true;
+ key_chunk = chunk;
+ node_iter->idx = chunk + 1;
break;
}
case RT_NODE_KIND_256:
{
RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
- int i;
+ int chunk;
- for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ for (chunk = node_iter->idx; chunk < RT_NODE_MAX_SLOTS; chunk++)
{
#ifdef RT_NODE_LEVEL_LEAF
- if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+ if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
#else
- if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
#endif
break;
}
- if (i >= RT_NODE_MAX_SLOTS)
- break;
+ if (chunk >= RT_NODE_MAX_SLOTS)
+ return false;
- node_iter->current_idx = i;
#ifdef RT_NODE_LEVEL_LEAF
- value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+ *value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
#else
- child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, chunk));
#endif
- key_chunk = i;
- found = true;
+ key_chunk = chunk;
+ node_iter->idx = chunk + 1;
break;
}
}
- if (found)
- {
- RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
-#ifdef RT_NODE_LEVEL_LEAF
- *value_p = value;
-#endif
- }
+ /* Update the part of the key */
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << node_iter->node->shift);
+ iter->key |= (((uint64) key_chunk) << node_iter->node->shift);
#ifdef RT_NODE_LEVEL_LEAF
- return found;
+ return true;
#else
return child;
#endif
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index ce645cb8b5..7ad1ce3605 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -4,8 +4,10 @@ CREATE EXTENSION test_radixtree;
-- an error if something fails.
--
SELECT test_radixtree();
-NOTICE: testing basic operations with leaf node 4
-NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 3
+NOTICE: testing basic operations with inner node 3
+NOTICE: testing basic operations with leaf node 15
+NOTICE: testing basic operations with inner node 15
NOTICE: testing basic operations with leaf node 32
NOTICE: testing basic operations with inner node 32
NOTICE: testing basic operations with leaf node 125
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index afe53382f3..5a169854d9 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -43,12 +43,15 @@ typedef uint64 TestValueType;
*/
static const bool rt_test_stats = false;
-static int rt_node_kind_fanouts[] = {
- 0,
- 4, /* RT_NODE_KIND_4 */
- 32, /* RT_NODE_KIND_32 */
- 125, /* RT_NODE_KIND_125 */
- 256 /* RT_NODE_KIND_256 */
+/*
+ * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO?
+ */
+static int rt_node_class_fanouts[] = {
+ 3, /* RT_CLASS_3 */
+ 15, /* RT_CLASS_32_MIN */
+ 32, /* RT_CLASS_32_MAX */
+ 125, /* RT_CLASS_125 */
+ 256 /* RT_CLASS_256 */
};
/*
* A struct to define a pattern of integers, for use with the test_pattern()
@@ -260,10 +263,9 @@ test_basic(int children, bool test_inner)
* Check if keys from start to end with the shift exist in the tree.
*/
static void
-check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
- int incr)
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end)
{
- for (int i = start; i < end; i++)
+ for (int i = start; i <= end; i++)
{
uint64 key = ((uint64) i << shift);
TestValueType val;
@@ -277,22 +279,26 @@ check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
}
}
+/*
+ * Insert 256 key-value pairs, and check if keys are properly inserted on each
+ * node class.
+ */
+/* Test keys [0, 256) */
+#define NODE_TYPE_TEST_KEY_MIN 0
+#define NODE_TYPE_TEST_KEY_MAX 256
static void
-test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+test_node_types_insert_asc(rt_radix_tree *radixtree, uint8 shift)
{
- uint64 num_entries;
- int ninserted = 0;
- int start = insert_asc ? 0 : 256;
- int incr = insert_asc ? 1 : -1;
- int end = insert_asc ? 256 : 0;
- int node_kind_idx = 1;
+ uint64 num_entries;
+ int node_class_idx = 0;
+ uint64 key_checked = 0;
- for (int i = start; i != end; i += incr)
+ for (int i = NODE_TYPE_TEST_KEY_MIN; i < NODE_TYPE_TEST_KEY_MAX; i++)
{
uint64 key = ((uint64) i << shift);
bool found;
- found = rt_set(radixtree, key, (TestValueType*) &key);
+ found = rt_set(radixtree, key, (TestValueType *) &key);
if (found)
elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
@@ -300,24 +306,49 @@ test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
* After filling all slots in each node type, check if the values
* are stored properly.
*/
- if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ if ((i + 1) == rt_node_class_fanouts[node_class_idx])
{
- int check_start = insert_asc
- ? rt_node_kind_fanouts[node_kind_idx - 1]
- : rt_node_kind_fanouts[node_kind_idx];
- int check_end = insert_asc
- ? rt_node_kind_fanouts[node_kind_idx]
- : rt_node_kind_fanouts[node_kind_idx - 1];
-
- check_search_on_node(radixtree, shift, check_start, check_end, incr);
- node_kind_idx++;
+ check_search_on_node(radixtree, shift, key_checked, i);
+ key_checked = i;
+ node_class_idx++;
}
-
- ninserted++;
}
num_entries = rt_num_entries(radixtree);
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Similar to test_node_types_insert_asc(), but inserts keys in descending order.
+ */
+static void
+test_node_types_insert_desc(rt_radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+ int node_class_idx = 0;
+ uint64 key_checked = NODE_TYPE_TEST_KEY_MAX - 1;
+
+ for (int i = NODE_TYPE_TEST_KEY_MAX - 1; i >= NODE_TYPE_TEST_KEY_MIN; i--)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, (TestValueType *) &key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+ if ((i + 1) == rt_node_class_fanouts[node_class_idx])
+ {
+ check_search_on_node(radixtree, shift, i, key_checked);
+ key_checked = i;
+ node_class_idx++;
+ }
+ }
+
+ num_entries = rt_num_entries(radixtree);
if (num_entries != 256)
elog(ERROR,
"rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
@@ -329,7 +360,7 @@ test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
{
uint64 num_entries;
- for (int i = 0; i < 256; i++)
+ for (int i = NODE_TYPE_TEST_KEY_MIN; i < NODE_TYPE_TEST_KEY_MAX; i++)
{
uint64 key = ((uint64) i << shift);
bool found;
@@ -379,9 +410,9 @@ test_node_types(uint8 shift)
* then delete all entries to make it empty, and insert and search entries
* again.
*/
- test_node_types_insert(radixtree, shift, true);
+ test_node_types_insert_asc(radixtree, shift);
test_node_types_delete(radixtree, shift);
- test_node_types_insert(radixtree, shift, false);
+ test_node_types_insert_desc(radixtree, shift);
rt_free(radixtree);
#ifdef RT_SHMEM
@@ -664,10 +695,10 @@ test_radixtree(PG_FUNCTION_ARGS)
{
test_empty();
- for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ for (int i = 0; i < lengthof(rt_node_class_fanouts); i++)
{
- test_basic(rt_node_kind_fanouts[i], false);
- test_basic(rt_node_kind_fanouts[i], true);
+ test_basic(rt_node_class_fanouts[i], false);
+ test_basic(rt_node_class_fanouts[i], true);
}
for (int shift = 0; shift <= (64 - 8); shift += 8)
--
2.31.1
v29-0010-Revert-building-benchmark-module-for-CI.patchapplication/octet-stream; name=v29-0010-Revert-building-benchmark-module-for-CI.patchDownload
From b6a692913ce8c6868996336f4be778eb5f83d02c Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 19:31:34 +0700
Subject: [PATCH v29 10/10] Revert building benchmark module for CI
---
contrib/meson.build | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/contrib/meson.build b/contrib/meson.build
index 421d469f8c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,7 +12,7 @@ subdir('amcheck')
subdir('auth_delay')
subdir('auto_explain')
subdir('basic_archive')
-subdir('bench_radix_tree')
+#subdir('bench_radix_tree')
subdir('bloom')
subdir('basebackup_to_shell')
subdir('bool_plperl')
--
2.31.1
v29-0009-Review-vacuum-integration.patchapplication/octet-stream; name=v29-0009-Review-vacuum-integration.patchDownload
From e804119fddce3bc0520bedc70c966470c7db35e9 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 17 Feb 2023 00:04:37 +0900
Subject: [PATCH v29 09/10] Review vacuum integration.
---
src/backend/access/heap/vacuumlazy.c | 61 +++++++++++++--------------
src/backend/commands/vacuum.c | 4 +-
src/backend/commands/vacuumparallel.c | 25 +++++------
3 files changed, 45 insertions(+), 45 deletions(-)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b4e40423a8..edb9079124 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -10,11 +10,10 @@
* of dead TIDs at once.
*
* We are willing to use at most maintenance_work_mem (or perhaps
- * autovacuum_work_mem) memory space to keep track of dead TIDs. We initially
- * create a TidStore with the maximum bytes that can be used by the TidStore.
- * If the TidStore is full, we must call lazy_vacuum to vacuum indexes (and to
- * vacuum the pages that we've pruned). This frees up the memory space dedicated
- * to storing dead TIDs.
+ * autovacuum_work_mem) memory space to keep track of dead TIDs. If the
+ * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
+ * the pages that we've pruned). This frees up the memory space dedicated to
+ * to store dead TIDs.
*
* In practice VACUUM will often complete its initial pass over the target
* heap relation without ever running out of space to store TIDs. This means
@@ -844,7 +843,7 @@ lazy_scan_heap(LVRelState *vacrel)
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
+ initprog_val[2] = TidStoreMaxMemory(vacrel->dead_items);
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -911,7 +910,7 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- if (tidstore_is_full(vacrel->dead_items))
+ if (TidStoreIsFull(vacrel->dead_items))
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -1080,16 +1079,16 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(tidstore_num_tids(dead_items) == 0);
+ Assert(TidStoreNumTids(dead_items) == 0);
}
else if (prunestate.num_offsets > 0)
{
/* Save details of the LP_DEAD items from the page in dead_items */
- tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
- prunestate.num_offsets);
+ TidStoreSetBlockOffsets(dead_items, blkno, prunestate.deadoffsets,
+ prunestate.num_offsets);
pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
- tidstore_memory_usage(dead_items));
+ TidStoreMemoryUsage(dead_items));
}
/*
@@ -1260,7 +1259,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (tidstore_num_tids(dead_items) > 0)
+ if (TidStoreNumTids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -2127,10 +2126,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
+ TidStoreSetBlockOffsets(dead_items, blkno, deadoffsets, lpdead_items);
pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
- tidstore_memory_usage(dead_items));
+ TidStoreMemoryUsage(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2179,7 +2178,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- tidstore_reset(vacrel->dead_items);
+ TidStoreReset(vacrel->dead_items);
return;
}
@@ -2208,7 +2207,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
+ Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2236,7 +2235,7 @@ lazy_vacuum(LVRelState *vacrel)
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
bypass = (vacrel->lpdead_item_pages < threshold) &&
- tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
+ TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2281,7 +2280,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- tidstore_reset(vacrel->dead_items);
+ TidStoreReset(vacrel->dead_items);
}
/*
@@ -2354,7 +2353,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
+ TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2394,7 +2393,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
TidStoreIter *iter;
- TidStoreIterResult *result;
+ TidStoreIterResult *iter_result;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2409,8 +2408,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
VACUUM_ERRCB_PHASE_VACUUM_HEAP,
InvalidBlockNumber, InvalidOffsetNumber);
- iter = tidstore_begin_iterate(vacrel->dead_items);
- while ((result = tidstore_iterate_next(iter)) != NULL)
+ iter = TidStoreBeginIterate(vacrel->dead_items);
+ while ((iter_result = TidStoreIterateNext(iter)) != NULL)
{
BlockNumber blkno;
Buffer buf;
@@ -2419,7 +2418,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- blkno = result->blkno;
+ blkno = iter_result->blkno;
vacrel->blkno = blkno;
/*
@@ -2433,8 +2432,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
- buf, vmbuffer);
+ lazy_vacuum_heap_page(vacrel, blkno, iter_result->offsets,
+ iter_result->num_offsets, buf, vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2444,7 +2443,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
vacuumed_pages++;
}
- tidstore_end_iterate(iter);
+ TidStoreEndIterate(iter);
vacrel->blkno = InvalidBlockNumber;
if (BufferIsValid(vmbuffer))
@@ -2455,12 +2454,12 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* the second heap pass. No more, no less.
*/
Assert(vacrel->num_index_scans > 1 ||
- (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
+ (TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
- vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+ (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, TidStoreNumTids(vacrel->dead_items),
vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
@@ -3118,8 +3117,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- vacrel->dead_items = tidstore_create(vac_work_mem, MaxHeapTuplesPerPage,
- NULL);
+ vacrel->dead_items = TidStoreCreate(vac_work_mem, MaxHeapTuplesPerPage,
+ NULL);
}
/*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index d8e680ca20..5fb30d7e62 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2311,7 +2311,7 @@ vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
ereport(ivinfo->message_level,
(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- tidstore_num_tids(dead_items))));
+ TidStoreNumTids(dead_items))));
return istat;
}
@@ -2352,5 +2352,5 @@ vac_tid_reaped(ItemPointer itemptr, void *state)
{
TidStore *dead_items = (TidStore *) state;
- return tidstore_lookup_tid(dead_items, itemptr);
+ return TidStoreIsMember(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index d653683693..9225daf3ab 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -9,11 +9,12 @@
* In a parallel vacuum, we perform both index bulk deletion and index cleanup
* with parallel worker processes. Individual indexes are processed by one
* vacuum process. ParalleVacuumState contains shared information as well as
- * the shared TidStore. We launch parallel worker processes at the start of
- * parallel index bulk-deletion and index cleanup and once all indexes are
- * processed, the parallel worker processes exit. Each time we process indexes
- * in parallel, the parallel context is re-initialized so that the same DSM can
- * be used for multiple passes of index bulk-deletion and index cleanup.
+ * the memory space for storing dead items allocated in the DSA area. We
+ * launch parallel worker processes at the start of parallel index
+ * bulk-deletion and index cleanup and once all indexes are processed, the
+ * parallel worker processes exit. Each time we process indexes in parallel,
+ * the parallel context is re-initialized so that the same DSM can be used for
+ * multiple passes of index bulk-deletion and index cleanup.
*
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
@@ -104,7 +105,7 @@ typedef struct PVShared
pg_atomic_uint32 idx;
/* Handle of the shared TidStore */
- tidstore_handle dead_items_handle;
+ TidStoreHandle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -289,7 +290,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ /* Initial size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -362,7 +363,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
LWTRANCHE_PARALLEL_VACUUM_DSA,
pcxt->seg);
- dead_items = tidstore_create(vac_work_mem, max_offset, dead_items_dsa);
+ dead_items = TidStoreCreate(vac_work_mem, max_offset, dead_items_dsa);
pvs->dead_items = dead_items;
pvs->dead_items_area = dead_items_dsa;
@@ -375,7 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
- shared->dead_items_handle = tidstore_get_handle(dead_items);
+ shared->dead_items_handle = TidStoreGetHandle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -441,7 +442,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
- tidstore_destroy(pvs->dead_items);
+ TidStoreDestroy(pvs->dead_items);
dsa_detach(pvs->dead_items_area);
DestroyParallelContext(pvs->pcxt);
@@ -999,7 +1000,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Set dead items */
area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
dead_items_area = dsa_attach_in_place(area_space, seg);
- dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
+ dead_items = TidStoreAttach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1045,7 +1046,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
- tidstore_detach(pvs.dead_items);
+ TidStoreDetach(dead_items);
dsa_detach(dead_items_area);
/* Pop the error context stack */
--
2.31.1
v29-0008-Review-TidStore.patchapplication/octet-stream; name=v29-0008-Review-TidStore.patchDownload
From fc373e0312e0b3c30bba8bd54286283542d627a2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Feb 2023 23:45:39 +0900
Subject: [PATCH v29 08/10] Review TidStore.
---
src/backend/access/common/tidstore.c | 340 +++++++++---------
src/include/access/tidstore.h | 37 +-
.../modules/test_tidstore/test_tidstore.c | 68 ++--
3 files changed, 234 insertions(+), 211 deletions(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 8c05e60d92..9360520482 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -3,18 +3,19 @@
* tidstore.c
* Tid (ItemPointerData) storage implementation.
*
- * This module provides a in-memory data structure to store Tids (ItemPointer).
- * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
- * stored in the radix tree.
+ * TidStore is a in-memory data structure to store tids (ItemPointerData).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
+ * and stored in the radix tree.
*
- * A TidStore can be shared among parallel worker processes by passing DSA area
- * to tidstore_create(). Other backends can attach to the shared TidStore by
- * tidstore_attach().
+ * TidStore can be shared among parallel worker processes by passing DSA area
+ * to TidStoreCreate(). Other backends can attach to the shared TidStore by
+ * TidStoreAttach().
*
- * Regarding the concurrency, it basically relies on the concurrency support in
- * the radix tree, but we acquires the lock on a TidStore in some cases, for
- * example, when to reset the store and when to access the number tids in the
- * store (num_tids).
+ * Regarding the concurrency support, we use a single LWLock for the TidStore.
+ * The TidStore is exclusively locked when inserting encoded tids to the
+ * radix tree or when resetting itself. When searching on the TidStore or
+ * doing the iteration, it is not locked but the underlying radix tree is
+ * locked in shared mode.
*
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
@@ -34,16 +35,18 @@
#include "utils/memutils.h"
/*
- * For encoding purposes, tids are represented as a pair of 64-bit key and
- * 64-bit value. First, we construct 64-bit unsigned integer by combining
- * the block number and the offset number. The number of bits used for the
- * offset number is specified by max_offsets in tidstore_create(). We are
- * frugal with the bits, because smaller keys could help keeping the radix
- * tree shallow.
+ * For encoding purposes, a tid is represented as a pair of 64-bit key and
+ * 64-bit value.
*
- * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
- * the offset number and uses the next 32 bits for the block number. That
- * is, only 41 bits are used:
+ * First, we construct a 64-bit unsigned integer by combining the block
+ * number and the offset number. The number of bits used for the offset number
+ * is specified by max_off in TidStoreCreate(). We are frugal with the bits,
+ * because smaller keys could help keeping the radix tree shallow.
+ *
+ * For example, a tid of heap on a 8kB block uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. 9 bits
+ * are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks. That is, only 41 bits are used:
*
* uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
*
@@ -52,30 +55,34 @@
* u = unused bit
* (high on the left, low on the right)
*
- * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
- * on 8kB blocks.
- *
- * The 64-bit value is the bitmap representation of the lowest 6 bits
- * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
- * as the key:
+ * Then, 64-bit value is the bitmap representation of the lowest 6 bits
+ * (LOWER_OFFSET_NBITS) of the integer, and 64-bit key consists of the
+ * upper 3 bits of the offset number and the block number, 35 bits in
+ * total:
*
* uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
* |----| value
- * |---------------------------------------------| key
+ * |--------------------------------------| key
*
* The maximum height of the radix tree is 5 in this case.
+ *
+ * If the number of bits required for offset numbers fits in LOWER_OFFSET_NBITS,
+ * 64-bit value is the bitmap representation of the offset number, and the
+ * 64-bit key is the block number.
*/
-#define TIDSTORE_VALUE_NBITS 6 /* log(64, 2) */
-#define TIDSTORE_OFFSET_MASK ((1 << TIDSTORE_VALUE_NBITS) - 1)
+typedef uint64 tidkey;
+typedef uint64 offsetbm;
+#define LOWER_OFFSET_NBITS 6 /* log(sizeof(offsetbm), 2) */
+#define LOWER_OFFSET_MASK ((1 << LOWER_OFFSET_NBITS) - 1)
-/* A magic value used to identify our TidStores. */
+/* A magic value used to identify our TidStore. */
#define TIDSTORE_MAGIC 0x826f6a10
#define RT_PREFIX local_rt
#define RT_SCOPE static
#define RT_DECLARE
#define RT_DEFINE
-#define RT_VALUE_TYPE uint64
+#define RT_VALUE_TYPE tidkey
#include "lib/radixtree.h"
#define RT_PREFIX shared_rt
@@ -83,7 +90,7 @@
#define RT_SCOPE static
#define RT_DECLARE
#define RT_DEFINE
-#define RT_VALUE_TYPE uint64
+#define RT_VALUE_TYPE tidkey
#include "lib/radixtree.h"
/* The control object for a TidStore */
@@ -94,10 +101,10 @@ typedef struct TidStoreControl
/* These values are never changed after creation */
size_t max_bytes; /* the maximum bytes a TidStore can use */
- int max_offset; /* the maximum offset number */
- int offset_nbits; /* the number of bits required for an offset
- * number */
- int offset_key_nbits; /* the number of bits of an offset number
+ int max_off; /* the maximum offset number */
+ int max_off_nbits; /* the number of bits required for offset
+ * numbers */
+ int upper_off_nbits; /* the number of bits of offset numbers
* used in a key */
/* The below fields are used only in shared case */
@@ -106,7 +113,7 @@ typedef struct TidStoreControl
LWLock lock;
/* handles for TidStore and radix tree */
- tidstore_handle handle;
+ TidStoreHandle handle;
shared_rt_handle tree_handle;
} TidStoreControl;
@@ -147,24 +154,27 @@ typedef struct TidStoreIter
bool finished;
/* save for the next iteration */
- uint64 next_key;
- uint64 next_val;
+ tidkey next_tidkey;
+ offsetbm next_off_bitmap;
- /* output for the caller */
- TidStoreIterResult result;
+ /*
+ * output for the caller. Must be last because variable-size.
+ */
+ TidStoreIterResult output;
} TidStoreIter;
-static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
-static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
-static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit);
-static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit);
+static void iter_decode_key_off(TidStoreIter *iter, tidkey key, offsetbm off_bitmap);
+static inline BlockNumber key_get_blkno(TidStore *ts, tidkey key);
+static inline tidkey encode_blk_off(TidStore *ts, BlockNumber block,
+ OffsetNumber offset, offsetbm *off_bit);
+static inline tidkey encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit);
/*
* Create a TidStore. The returned object is allocated in backend-local memory.
* The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
*/
TidStore *
-tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
{
TidStore *ts;
@@ -176,12 +186,12 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
* Memory consumption depends on the number of stored tids, but also on the
* distribution of them, how the radix tree stores, and the memory management
* that backed the radix tree. The maximum bytes that a TidStore can
- * use is specified by the max_bytes in tidstore_create(). We want the total
+ * use is specified by the max_bytes in TidStoreCreate(). We want the total
* amount of memory consumption by a TidStore not to exceed the max_bytes.
*
* In local TidStore cases, the radix tree uses slab allocators for each kind
* of node class. The most memory consuming case while adding Tids associated
- * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+ * with one page (i.e. during TidStoreSetBlockOffsets()) is that we allocate a new
* slab block for a new radix tree node, which is approximately 70kB. Therefore,
* we deduct 70kB from the max_bytes.
*
@@ -202,7 +212,7 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
dp = dsa_allocate0(area, sizeof(TidStoreControl));
ts->control = (TidStoreControl *) dsa_get_address(area, dp);
- ts->control->max_bytes = (uint64) (max_bytes * ratio);
+ ts->control->max_bytes = (size_t) (max_bytes * ratio);
ts->area = area;
ts->control->magic = TIDSTORE_MAGIC;
@@ -218,14 +228,14 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
ts->control->max_bytes = max_bytes - (70 * 1024);
}
- ts->control->max_offset = max_offset;
- ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+ ts->control->max_off = max_off;
+ ts->control->max_off_nbits = pg_ceil_log2_32(max_off);
- if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
- ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
+ if (ts->control->max_off_nbits < LOWER_OFFSET_NBITS)
+ ts->control->max_off_nbits = LOWER_OFFSET_NBITS;
- ts->control->offset_key_nbits =
- ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+ ts->control->upper_off_nbits =
+ ts->control->max_off_nbits - LOWER_OFFSET_NBITS;
return ts;
}
@@ -235,7 +245,7 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
* allocated in backend-local memory using the CurrentMemoryContext.
*/
TidStore *
-tidstore_attach(dsa_area *area, tidstore_handle handle)
+TidStoreAttach(dsa_area *area, TidStoreHandle handle)
{
TidStore *ts;
dsa_pointer control;
@@ -266,7 +276,7 @@ tidstore_attach(dsa_area *area, tidstore_handle handle)
* to the operating system.
*/
void
-tidstore_detach(TidStore *ts)
+TidStoreDetach(TidStore *ts)
{
Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
@@ -279,12 +289,12 @@ tidstore_detach(TidStore *ts)
*
* TODO: The caller must be certain that no other backend will attempt to
* access the TidStore before calling this function. Other backend must
- * explicitly call tidstore_detach to free up backend-local memory associated
- * with the TidStore. The backend that calls tidstore_destroy must not call
- * tidstore_detach.
+ * explicitly call TidStoreDetach() to free up backend-local memory associated
+ * with the TidStore. The backend that calls TidStoreDestroy() must not call
+ * TidStoreDetach().
*/
void
-tidstore_destroy(TidStore *ts)
+TidStoreDestroy(TidStore *ts)
{
if (TidStoreIsShared(ts))
{
@@ -309,11 +319,11 @@ tidstore_destroy(TidStore *ts)
}
/*
- * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * Forget all collected Tids. It's similar to TidStoreDestroy() but we don't free
* entire TidStore but recreate only the radix tree storage.
*/
void
-tidstore_reset(TidStore *ts)
+TidStoreReset(TidStore *ts)
{
if (TidStoreIsShared(ts))
{
@@ -350,30 +360,34 @@ tidstore_reset(TidStore *ts)
}
}
-/* Add Tids on a block to TidStore */
+/*
+ * Set the given tids on the blkno to TidStore.
+ *
+ * NB: the offset numbers in offsets must be sorted in ascending order.
+ */
void
-tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
- int num_offsets)
+TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
{
- uint64 *values;
- uint64 key;
- uint64 prev_key;
- uint64 off_bitmap = 0;
+ offsetbm *bitmaps;
+ tidkey key;
+ tidkey prev_key;
+ offsetbm off_bitmap = 0;
int idx;
- const uint64 key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
- const int nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+ const tidkey key_base = ((uint64) blkno) << ts->control->upper_off_nbits;
+ const int nkeys = UINT64CONST(1) << ts->control->upper_off_nbits;
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
- values = palloc(sizeof(uint64) * nkeys);
+ bitmaps = palloc(sizeof(offsetbm) * nkeys);
key = prev_key = key_base;
for (int i = 0; i < num_offsets; i++)
{
- uint64 off_bit;
+ offsetbm off_bit;
/* encode the tid to a key and partial offset */
- key = encode_key_off(ts, blkno, offsets[i], &off_bit);
+ key = encode_blk_off(ts, blkno, offsets[i], &off_bit);
/* make sure we scanned the line pointer array in order */
Assert(key >= prev_key);
@@ -384,11 +398,11 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
Assert(idx >= 0 && idx < nkeys);
/* write out offset bitmap for this key */
- values[idx] = off_bitmap;
+ bitmaps[idx] = off_bitmap;
/* zero out any gaps up to the current key */
for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
- values[empty_idx] = 0;
+ bitmaps[empty_idx] = 0;
/* reset for current key -- the current offset will be handled below */
off_bitmap = 0;
@@ -401,7 +415,7 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
/* save the final index for later */
idx = key - key_base;
/* write out last offset bitmap */
- values[idx] = off_bitmap;
+ bitmaps[idx] = off_bitmap;
if (TidStoreIsShared(ts))
LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
@@ -409,14 +423,14 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
/* insert the calculated key-values to the tree */
for (int i = 0; i <= idx; i++)
{
- if (values[i])
+ if (bitmaps[i])
{
key = key_base + i;
if (TidStoreIsShared(ts))
- shared_rt_set(ts->tree.shared, key, &values[i]);
+ shared_rt_set(ts->tree.shared, key, &bitmaps[i]);
else
- local_rt_set(ts->tree.local, key, &values[i]);
+ local_rt_set(ts->tree.local, key, &bitmaps[i]);
}
}
@@ -426,70 +440,70 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
if (TidStoreIsShared(ts))
LWLockRelease(&ts->control->lock);
- pfree(values);
+ pfree(bitmaps);
}
/* Return true if the given tid is present in the TidStore */
bool
-tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+TidStoreIsMember(TidStore *ts, ItemPointer tid)
{
- uint64 key;
- uint64 val = 0;
- uint64 off_bit;
+ tidkey key;
+ offsetbm off_bitmap = 0;
+ offsetbm off_bit;
bool found;
- key = tid_to_key_off(ts, tid, &off_bit);
+ key = encode_tid(ts, tid, &off_bit);
if (TidStoreIsShared(ts))
- found = shared_rt_search(ts->tree.shared, key, &val);
+ found = shared_rt_search(ts->tree.shared, key, &off_bitmap);
else
- found = local_rt_search(ts->tree.local, key, &val);
+ found = local_rt_search(ts->tree.local, key, &off_bitmap);
if (!found)
return false;
- return (val & off_bit) != 0;
+ return (off_bitmap & off_bit) != 0;
}
/*
- * Prepare to iterate through a TidStore. Since the radix tree is locked during the
- * iteration, so tidstore_end_iterate() needs to called when finished.
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during
+ * the iteration, so TidStoreEndIterate() needs to be called when finished.
+ *
+ * The TidStoreIter struct is created in the caller's memory context.
*
* Concurrent updates during the iteration will be blocked when inserting a
* key-value to the radix tree.
*/
TidStoreIter *
-tidstore_begin_iterate(TidStore *ts)
+TidStoreBeginIterate(TidStore *ts)
{
TidStoreIter *iter;
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
- iter = palloc0(sizeof(TidStoreIter));
+ iter = palloc0(sizeof(TidStoreIter) +
+ sizeof(OffsetNumber) * ts->control->max_off);
iter->ts = ts;
- iter->result.blkno = InvalidBlockNumber;
- iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
-
if (TidStoreIsShared(ts))
iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
else
iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
/* If the TidStore is empty, there is no business */
- if (tidstore_num_tids(ts) == 0)
+ if (TidStoreNumTids(ts) == 0)
iter->finished = true;
return iter;
}
static inline bool
-tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+tidstore_iter(TidStoreIter *iter, tidkey *key, offsetbm *off_bitmap)
{
if (TidStoreIsShared(iter->ts))
- return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+ return shared_rt_iterate_next(iter->tree_iter.shared, key, off_bitmap);
- return local_rt_iterate_next(iter->tree_iter.local, key, val);
+ return local_rt_iterate_next(iter->tree_iter.local, key, off_bitmap);
}
/*
@@ -498,45 +512,48 @@ tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
* numbers in each result is also sorted in ascending order.
*/
TidStoreIterResult *
-tidstore_iterate_next(TidStoreIter *iter)
+TidStoreIterateNext(TidStoreIter *iter)
{
- uint64 key;
- uint64 val;
- TidStoreIterResult *result = &(iter->result);
+ tidkey key;
+ offsetbm off_bitmap = 0;
+ TidStoreIterResult *output = &(iter->output);
if (iter->finished)
return NULL;
- if (BlockNumberIsValid(result->blkno))
- {
- /* Process the previously collected key-value */
- result->num_offsets = 0;
- tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
- }
+ /* Initialize the outputs */
+ output->blkno = InvalidBlockNumber;
+ output->num_offsets = 0;
- while (tidstore_iter_kv(iter, &key, &val))
- {
- BlockNumber blkno;
+ /*
+ * Decode the key and offset bitmap that are collected in the previous
+ * time, if exists.
+ */
+ if (iter->next_off_bitmap > 0)
+ iter_decode_key_off(iter, iter->next_tidkey, iter->next_off_bitmap);
- blkno = key_get_blkno(iter->ts, key);
+ while (tidstore_iter(iter, &key, &off_bitmap))
+ {
+ BlockNumber blkno = key_get_blkno(iter->ts, key);
- if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+ if (BlockNumberIsValid(output->blkno) && output->blkno != blkno)
{
/*
- * We got a key-value pair for a different block. So return the
- * collected tids, and remember the key-value for the next iteration.
+ * We got tids for a different block. We return the collected
+ * tids so far, and remember the key-value for the next
+ * iteration.
*/
- iter->next_key = key;
- iter->next_val = val;
- return result;
+ iter->next_tidkey = key;
+ iter->next_off_bitmap = off_bitmap;
+ return output;
}
- /* Collect tids extracted from the key-value pair */
- tidstore_iter_extract_tids(iter, key, val);
+ /* Collect tids decoded from the key and offset bitmap */
+ iter_decode_key_off(iter, key, off_bitmap);
}
iter->finished = true;
- return result;
+ return output;
}
/*
@@ -544,22 +561,21 @@ tidstore_iterate_next(TidStoreIter *iter)
* or when existing an iteration.
*/
void
-tidstore_end_iterate(TidStoreIter *iter)
+TidStoreEndIterate(TidStoreIter *iter)
{
if (TidStoreIsShared(iter->ts))
shared_rt_end_iterate(iter->tree_iter.shared);
else
local_rt_end_iterate(iter->tree_iter.local);
- pfree(iter->result.offsets);
pfree(iter);
}
/* Return the number of tids we collected so far */
int64
-tidstore_num_tids(TidStore *ts)
+TidStoreNumTids(TidStore *ts)
{
- uint64 num_tids;
+ int64 num_tids;
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
@@ -575,16 +591,16 @@ tidstore_num_tids(TidStore *ts)
/* Return true if the current memory usage of TidStore exceeds the limit */
bool
-tidstore_is_full(TidStore *ts)
+TidStoreIsFull(TidStore *ts)
{
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
- return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+ return (TidStoreMemoryUsage(ts) > ts->control->max_bytes);
}
/* Return the maximum memory TidStore can use */
size_t
-tidstore_max_memory(TidStore *ts)
+TidStoreMaxMemory(TidStore *ts)
{
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
@@ -593,7 +609,7 @@ tidstore_max_memory(TidStore *ts)
/* Return the memory usage of TidStore */
size_t
-tidstore_memory_usage(TidStore *ts)
+TidStoreMemoryUsage(TidStore *ts)
{
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
@@ -611,71 +627,75 @@ tidstore_memory_usage(TidStore *ts)
/*
* Get a handle that can be used by other processes to attach to this TidStore
*/
-tidstore_handle
-tidstore_get_handle(TidStore *ts)
+TidStoreHandle
+TidStoreGetHandle(TidStore *ts)
{
Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
return ts->control->handle;
}
-/* Extract tids from the given key-value pair */
+/*
+ * Decode the key and offset bitmap to tids and store them to the iteration
+ * result.
+ */
static void
-tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+iter_decode_key_off(TidStoreIter *iter, tidkey key, offsetbm off_bitmap)
{
- TidStoreIterResult *result = (&iter->result);
+ TidStoreIterResult *output = (&iter->output);
- while (val)
+ while (off_bitmap)
{
- uint64 tid_i;
+ uint64 compressed_tid;
OffsetNumber off;
- tid_i = key << TIDSTORE_VALUE_NBITS;
- tid_i |= pg_rightmost_one_pos64(val);
+ compressed_tid = key << LOWER_OFFSET_NBITS;
+ compressed_tid |= pg_rightmost_one_pos64(off_bitmap);
- off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+ off = compressed_tid & ((UINT64CONST(1) << iter->ts->control->max_off_nbits) - 1);
- Assert(result->num_offsets < iter->ts->control->max_offset);
- result->offsets[result->num_offsets++] = off;
+ Assert(output->num_offsets < iter->ts->control->max_off);
+ output->offsets[output->num_offsets++] = off;
/* unset the rightmost bit */
- val &= ~pg_rightmost_one64(val);
+ off_bitmap &= ~pg_rightmost_one64(off_bitmap);
}
- result->blkno = key_get_blkno(iter->ts, key);
+ output->blkno = key_get_blkno(iter->ts, key);
}
/* Get block number from the given key */
static inline BlockNumber
-key_get_blkno(TidStore *ts, uint64 key)
+key_get_blkno(TidStore *ts, tidkey key)
{
- return (BlockNumber) (key >> ts->control->offset_key_nbits);
+ return (BlockNumber) (key >> ts->control->upper_off_nbits);
}
-/* Encode a tid to key and offset */
-static inline uint64
-tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit)
+/* Encode a tid to key and partial offset */
+static inline tidkey
+encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit)
{
- uint32 offset = ItemPointerGetOffsetNumber(tid);
+ OffsetNumber offset = ItemPointerGetOffsetNumber(tid);
BlockNumber block = ItemPointerGetBlockNumber(tid);
- return encode_key_off(ts, block, offset, off_bit);
+ return encode_blk_off(ts, block, offset, off_bit);
}
/* encode a block and offset to a key and partial offset */
-static inline uint64
-encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit)
+static inline tidkey
+encode_blk_off(TidStore *ts, BlockNumber block, OffsetNumber offset,
+ offsetbm *off_bit)
{
- uint64 key;
- uint64 tid_i;
+ tidkey key;
+ uint64 compressed_tid;
uint32 off_lower;
- off_lower = offset & TIDSTORE_OFFSET_MASK;
- Assert(off_lower < (sizeof(uint64) * BITS_PER_BYTE));
+ off_lower = offset & LOWER_OFFSET_MASK;
+ Assert(off_lower < (sizeof(offsetbm) * BITS_PER_BYTE));
*off_bit = UINT64CONST(1) << off_lower;
- tid_i = offset | ((uint64) block << ts->control->offset_nbits);
- key = tid_i >> TIDSTORE_VALUE_NBITS;
+ compressed_tid = offset | ((uint64) block << ts->control->max_off_nbits);
+ key = compressed_tid >> LOWER_OFFSET_NBITS;
return key;
}
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index a35a52124a..66f0fdd482 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -17,33 +17,34 @@
#include "storage/itemptr.h"
#include "utils/dsa.h"
-typedef dsa_pointer tidstore_handle;
+typedef dsa_pointer TidStoreHandle;
typedef struct TidStore TidStore;
typedef struct TidStoreIter TidStoreIter;
+/* Result struct for TidStoreIterateNext */
typedef struct TidStoreIterResult
{
BlockNumber blkno;
- OffsetNumber *offsets;
int num_offsets;
+ OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
} TidStoreIterResult;
-extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
-extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
-extern void tidstore_detach(TidStore *ts);
-extern void tidstore_destroy(TidStore *ts);
-extern void tidstore_reset(TidStore *ts);
-extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
- int num_offsets);
-extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
-extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
-extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
-extern void tidstore_end_iterate(TidStoreIter *iter);
-extern int64 tidstore_num_tids(TidStore *ts);
-extern bool tidstore_is_full(TidStore *ts);
-extern size_t tidstore_max_memory(TidStore *ts);
-extern size_t tidstore_memory_usage(TidStore *ts);
-extern tidstore_handle tidstore_get_handle(TidStore *ts);
+extern TidStore *TidStoreCreate(size_t max_bytes, int max_off, dsa_area *dsa);
+extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer handle);
+extern void TidStoreDetach(TidStore *ts);
+extern void TidStoreDestroy(TidStore *ts);
+extern void TidStoreReset(TidStore *ts);
+extern void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool TidStoreIsMember(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * TidStoreBeginIterate(TidStore *ts);
+extern TidStoreIterResult *TidStoreIterateNext(TidStoreIter *iter);
+extern void TidStoreEndIterate(TidStoreIter *iter);
+extern int64 TidStoreNumTids(TidStore *ts);
+extern bool TidStoreIsFull(TidStore *ts);
+extern size_t TidStoreMaxMemory(TidStore *ts);
+extern size_t TidStoreMemoryUsage(TidStore *ts);
+extern TidStoreHandle TidStoreGetHandle(TidStore *ts);
#endif /* TIDSTORE_H */
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index 9a1217f833..8659e6780e 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -37,10 +37,10 @@ check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
ItemPointerSet(&tid, blkno, off);
- found = tidstore_lookup_tid(ts, &tid);
+ found = TidStoreIsMember(ts, &tid);
if (found != expect)
- elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+ elog(ERROR, "TidStoreIsMember for TID (%u, %u) returned %d, expected %d",
blkno, off, found, expect);
}
@@ -69,9 +69,9 @@ test_basic(int max_offset)
LWLockRegisterTranche(tranche_id, "test_tidstore");
dsa = dsa_create(tranche_id);
- ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
+ ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
#else
- ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+ ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
#endif
/* prepare the offset array */
@@ -83,7 +83,7 @@ test_basic(int max_offset)
/* add tids */
for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
- tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+ TidStoreSetBlockOffsets(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
/* lookup test */
for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
@@ -105,30 +105,30 @@ test_basic(int max_offset)
}
/* test the number of tids */
- if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
- elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
- tidstore_num_tids(ts),
+ if (TidStoreNumTids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+ elog(ERROR, "TidStoreNumTids returned " UINT64_FORMAT ", expected %d",
+ TidStoreNumTids(ts),
TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
/* iteration test */
- iter = tidstore_begin_iterate(ts);
+ iter = TidStoreBeginIterate(ts);
blk_idx = 0;
- while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+ while ((iter_result = TidStoreIterateNext(iter)) != NULL)
{
/* check the returned block number */
if (blks_sorted[blk_idx] != iter_result->blkno)
- elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+ elog(ERROR, "TidStoreIterateNext returned block number %u, expected %u",
iter_result->blkno, blks_sorted[blk_idx]);
/* check the returned offset numbers */
if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
- elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+ elog(ERROR, "TidStoreIterateNext %u offsets, expected %u",
iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
for (int i = 0; i < iter_result->num_offsets; i++)
{
if (offs[i] != iter_result->offsets[i])
- elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+ elog(ERROR, "TidStoreIterateNext offset number %u on block %u, expected %u",
iter_result->offsets[i], iter_result->blkno, offs[i]);
}
@@ -136,15 +136,15 @@ test_basic(int max_offset)
}
if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
- elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+ elog(ERROR, "TidStoreIterateNext returned %d blocks, expected %d",
blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
/* remove all tids */
- tidstore_reset(ts);
+ TidStoreReset(ts);
/* test the number of tids */
- if (tidstore_num_tids(ts) != 0)
- elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+ if (TidStoreNumTids(ts) != 0)
+ elog(ERROR, "TidStoreNumTids on empty store returned non-zero");
/* lookup test for empty store */
for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
@@ -156,7 +156,7 @@ test_basic(int max_offset)
check_tid(ts, MaxBlockNumber, off, false);
}
- tidstore_destroy(ts);
+ TidStoreDestroy(ts);
#ifdef TEST_SHARED_TIDSTORE
dsa_detach(dsa);
@@ -177,36 +177,37 @@ test_empty(void)
LWLockRegisterTranche(tranche_id, "test_tidstore");
dsa = dsa_create(tranche_id);
- ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
+ ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
#else
- ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+ ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
#endif
elog(NOTICE, "testing empty tidstore");
ItemPointerSet(&tid, 0, FirstOffsetNumber);
- if (tidstore_lookup_tid(ts, &tid))
- elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+ if (TidStoreIsMember(ts, &tid))
+ elog(ERROR, "TidStoreIsMember for TID (%u,%u) on empty store returned true",
+ 0, FirstOffsetNumber);
ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
- if (tidstore_lookup_tid(ts, &tid))
- elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+ if (TidStoreIsMember(ts, &tid))
+ elog(ERROR, "TidStoreIsMember for TID (%u,%u) on empty store returned true",
MaxBlockNumber, MaxOffsetNumber);
- if (tidstore_num_tids(ts) != 0)
- elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+ if (TidStoreNumTids(ts) != 0)
+ elog(ERROR, "TidStoreNumTids on empty store returned non-zero");
- if (tidstore_is_full(ts))
- elog(ERROR, "tidstore_is_full on empty store returned true");
+ if (TidStoreIsFull(ts))
+ elog(ERROR, "TidStoreIsFull on empty store returned true");
- iter = tidstore_begin_iterate(ts);
+ iter = TidStoreBeginIterate(ts);
- if (tidstore_iterate_next(iter) != NULL)
- elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+ if (TidStoreIterateNext(iter) != NULL)
+ elog(ERROR, "TidStoreIterateNext on empty store returned TIDs");
- tidstore_end_iterate(iter);
+ TidStoreEndIterate(iter);
- tidstore_destroy(ts);
+ TidStoreDestroy(ts);
#ifdef TEST_SHARED_TIDSTORE
dsa_detach(dsa);
@@ -221,6 +222,7 @@ test_tidstore(PG_FUNCTION_ARGS)
elog(NOTICE, "testing basic operations");
test_basic(MaxHeapTuplesPerPage);
test_basic(10);
+ test_basic(MaxHeapTuplesPerPage * 2);
PG_RETURN_VOID();
}
--
2.31.1
v29-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchapplication/octet-stream; name=v29-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchDownload
From 848d68ee7c484a7041c6d0d703304cadfdfc36a2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v29 05/10] Tool for measuring radix tree and tidstore
performance
Includes Meson support, but commented out to avoid warnings
XXX: Not for commit
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 88 +++
contrib/bench_radix_tree/bench_radix_tree.c | 747 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/meson.build | 33 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
contrib/meson.build | 1 +
8 files changed, 925 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/meson.build
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..ad66265e23
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,88 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_tidstore_load(
+minblk int4,
+maxblk int4,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT iter_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..6e5149e2c4
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,747 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+//#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+PG_FUNCTION_INFO_V1(bench_tidstore_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+Datum
+bench_tidstore_load(PG_FUNCTION_ARGS)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ TidStore *ts;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
+ OffsetNumber *offs;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_ms;
+ int64 iter_ms;
+ TupleDesc tupdesc;
+ Datum values[3];
+ bool nulls[3] = {false};
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ offs = palloc(sizeof(OffsetNumber) * TIDS_PER_BLOCK_FOR_LOAD);
+ for (int i = 0; i < TIDS_PER_BLOCK_FOR_LOAD; i++)
+ offs[i] = i + 1; /* FirstOffsetNumber is 1 */
+
+ ts = tidstore_create(1 * 1024L * 1024L * 1024L, MaxHeapTuplesPerPage, NULL);
+
+ /* load tids */
+ start_time = GetCurrentTimestamp();
+ for (BlockNumber blkno = minblk; blkno < maxblk; blkno++)
+ tidstore_add_tids(ts, blkno, offs, TIDS_PER_BLOCK_FOR_LOAD);
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_ms = secs * 1000 + usecs / 1000;
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* iterate through tids */
+ iter = tidstore_begin_iterate(ts);
+ start_time = GetCurrentTimestamp();
+ while ((result = tidstore_iterate_next(iter)) != NULL)
+ ;
+ tidstore_end_iterate(iter);
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ iter_ms = secs * 1000 + usecs / 1000;
+
+ values[0] = Int64GetDatum(tidstore_memory_usage(ts));
+ values[1] = Int64GetDatum(load_ms);
+ values[2] = Int64GetDatum(iter_ms);
+
+ tidstore_destroy(ts);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ rt_radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, &val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, &val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, &key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ int64 search_time_ms;
+ Datum values[3] = {0};
+ bool nulls[3] = {0};
+ /* from trial and error */
+ uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (!PG_ARGISNULL(1))
+ {
+ char *filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+ if (sscanf(filter_str, "0x%lX", &filter) == 0)
+ elog(ERROR, "invalid filter string %s", filter_str);
+ }
+ elog(NOTICE, "bench with filter 0x%lX", filter);
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 hash = hash64(i);
+ uint64 key = hash & filter;
+
+ rt_set(rt, key, &key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 hash = hash64(i);
+ uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+ values[2] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ uint64 key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, &key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_sparseload_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ uint64 r,
+ h;
+ uint64 key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ key_id = 0;
+
+ for (r = 1; r <= fanout; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (h);
+
+ key_id++;
+ rt_set(rt, key, &key_id);
+ }
+ }
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+
+ /* measure sparse deletion and re-loading */
+ start_time = GetCurrentTimestamp();
+
+ for (int t = 0; t<10000; t++)
+ {
+ /* delete one key in each leaf */
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ rt_delete(rt, key);
+ }
+
+ /* add them all back */
+ key_id = 0;
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ key_id++;
+ rt_set(rt, key, &key_id);
+ }
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_sparseload_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* to silence warnings about unused iter functions */
+static void pg_attribute_unused()
+stub_iter()
+{
+ rt_radix_tree *rt;
+ rt_iter *iter;
+ uint64 key = 1;
+ uint64 value = 1;
+
+ rt = rt_create(CurrentMemoryContext);
+
+ iter = rt_begin_iterate(rt);
+ rt_iterate_next(iter, &key, &value);
+ rt_end_iterate(iter);
+}
\ No newline at end of file
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+ 'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+ bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'bench_radix_tree',
+ '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+ bench_radix_tree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+ 'bench_radix_tree.control',
+ 'bench_radix_tree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'bench_radix_tree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'bench_radix_tree',
+ ],
+ },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..421d469f8c 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
subdir('auth_delay')
subdir('auto_explain')
subdir('basic_archive')
+subdir('bench_radix_tree')
subdir('bloom')
subdir('basebackup_to_shell')
subdir('bool_plperl')
--
2.31.1
v29-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchapplication/octet-stream; name=v29-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload
From 39f0e713854942fbad3678bce9138adea546f1be Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v29 02/10] Move some bitmap logic out of bitmapset.c
Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.
Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().
Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.
Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
src/backend/nodes/bitmapset.c | 36 ++------------------------------
src/include/nodes/bitmapset.h | 16 ++++++++++++--
src/include/port/pg_bitutils.h | 31 +++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 -
4 files changed, 47 insertions(+), 37 deletions(-)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x))
/*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
{
int result;
- w = RIGHTMOST_ONE(w);
+ w = bmw_rightmost_one(w);
a->words[wordnum] &= ~w;
result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 3d2225e1ae..5f9a511b4a 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
#else
#define BITS_PER_BITMAPWORD 32
typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
#endif
@@ -75,6 +73,20 @@ typedef enum
BMS_MULTIPLE /* >1 member */
} BMS_Membership;
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w) pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
+#define bmw_popcount(w) pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w) pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
+#define bmw_popcount(w) pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
/*
* function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+ int32 result = (int32) word & -((int32) word);
+ return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+ int64 result = (int64) word & -((int64) word);
+ return (uint64) result;
+}
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 45fc5759ce..f95d3dfd69 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3670,7 +3670,6 @@ shmem_request_hook_type
shmem_startup_hook_type
sig_atomic_t
sigjmp_buf
-signedbitmapword
sigset_t
size_t
slist_head
--
2.31.1
v29-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchapplication/octet-stream; name=v29-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchDownload
From 58f8e2b82eb196d463114d8ec3dad343b2b027e0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v29 01/10] Introduce helper SIMD functions for small byte
arrays
vector8_min - helper for emulating ">=" semantics
vector8_highbit_mask - used to turn the result of a vector
comparison into a bitmask
Masahiko Sawada
Reviewed by Nathan Bossart, additional adjustments by me
Discussion: https://www.postgresql.org/message-id/CAD21AoDap240WDDdUDE0JMpCmuMMnGajrKrkCRxM7zn9Xk3JRA%40mail.gmail.com
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 1fa6c3bc6c..dfae14e463 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -79,6 +79,7 @@ static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#endif
/* arithmetic operations */
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -299,6 +301,36 @@ vector32_is_highbit_set(const Vector32 v)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Return a bitmask formed from the high-bit of each element.
+ */
+#ifndef USE_NO_SIMD
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ /*
+ * Note: There is a faster way to do this, but it returns a uint64 and
+ * and if the caller wanted to extract the bit position using CTZ,
+ * it would have to divide that result by 4.
+ */
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
/*
* Return the bitwise OR of the inputs
*/
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Given two vectors, return a vector with the minimum element of each.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.31.1
v29-0003-Add-radixtree-template.patchapplication/octet-stream; name=v29-0003-Add-radixtree-template.patchDownload
From ab33774676db3e419dd56b2001f0cbf2bc291d3d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v29 03/10] Add radixtree template
WIP: commit message based on template comments
---
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 2516 +++++++++++++++++
src/include/lib/radixtree_delete_impl.h | 122 +
src/include/lib/radixtree_insert_impl.h | 328 +++
src/include/lib/radixtree_iter_impl.h | 153 +
src/include/lib/radixtree_search_impl.h | 138 +
src/include/utils/dsa.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 36 +
src/test/modules/test_radixtree/meson.build | 35 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 681 +++++
.../test_radixtree/test_radixtree.control | 4 +
src/tools/pginclude/cpluspluscheck | 6 +
src/tools/pginclude/headerscheck | 6 +
20 files changed, 4089 insertions(+)
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/include/lib/radixtree_delete_impl.h
create mode 100644 src/include/lib/radixtree_insert_impl.h
create mode 100644 src/include/lib/radixtree_iter_impl.h
create mode 100644 src/include/lib/radixtree_search_impl.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..80555aefff 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..e546bd705c
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2516 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Template for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ * tional leaf node type which stores one value.
+ * - Multi-value leaves: The values are stored in one of four
+ * different leaf node types, which mirror the structure of
+ * inner nodes, but contain values instead of pointers.
+ * - Combined pointer/value slots: If values fit into point-
+ * ers, no separate node types are necessary. Instead, each
+ * pointer storage location in an inner node can either
+ * store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * To handle concurrency, we use a single reader-writer lock for the radix
+ * tree. The radix tree is exclusively locked during write operations such
+ * as RT_SET() and RT_DELETE(), and shared locked during read operations
+ * such as RT_SEARCH(). An iteration also holds the shared lock on the radix
+ * tree until it is completed.
+ *
+ * TODO: The current locking mechanism is not optimized for high concurrency
+ * with mixed read-write workloads. In the future it might be worthwhile
+ * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
+ * the paper "The ART of Practical Synchronization" by the same authors as
+ * the ART paper, 2016.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included. Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * will result in radix tree type 'foo_radix_tree' and functions like
+ * 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ * generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ * declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
+ *
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ * so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE - Create a new, empty radix tree
+ * RT_FREE - Free the radix tree
+ * RT_SEARCH - Search a key-value pair
+ * RT_SET - Set a key-value pair
+ * RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT - Return next key-value pair, if any
+ * RT_END_ITER - End iteration
+ * RT_MEMORY_USAGE - Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH - Attach to the radix tree
+ * RT_DETACH - Detach from the radix tree
+ * RT_GET_HANDLE - Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE - Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif /* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define RT_BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
+#define RT_BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ * statements.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ * in the future to tag the node pointer with the kind, even on
+ * platforms with 32-bit pointers. This might speed up node traversal
+ * in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_3 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Max capacity for the current size class. Storing this in the
+ * node enables multiple size classes per node kind.
+ * Technically, kinds with a single size class don't need this, so we could
+ * keep this in the individual base types, but the code is simpler this way.
+ * Note: node256 is unique in that it cannot possibly have more than a
+ * single size class, so for that kind we store zero, and uint8 is
+ * sufficient for other kinds.
+ */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree) LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree) LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree) LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree) ((void) 0)
+#define RT_LOCK_SHARED(tree) ((void) 0)
+#define RT_UNLOCK(tree) ((void) 0)
+#endif
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+#define RT_NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
+
+#define RT_NODE_MUST_GROW(node) \
+ ((node)->base.n.count == (node)->base.n.fanout)
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_3
+{
+ RT_NODE n;
+
+ /* 3 children, for key chunks */
+ uint8 chunks[3];
+} RT_NODE_BASE_3;
+
+typedef struct RT_NODE_BASE_32
+{
+ RT_NODE n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_125
+{
+ RT_NODE n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* bitmap to track which slots are in use */
+ bitmapword isset[RT_BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+ RT_NODE n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * These are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_3
+{
+ RT_NODE_BASE_3 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_3;
+
+typedef struct RT_NODE_LEAF_3
+{
+ RT_NODE_BASE_3 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_3;
+
+typedef struct RT_NODE_INNER_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+ RT_NODE_BASE_256 base;
+
+ /* Slots for 256 children */
+ RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+ RT_NODE_BASE_256 base;
+
+ /*
+ * Unlike with inner256, zero is a valid value here, so we use a
+ * bitmap to track which slots are in use.
+ */
+ bitmapword isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ RT_VALUE_TYPE values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+ RT_CLASS_3 = 0,
+ RT_CLASS_32_MIN,
+ RT_CLASS_32_MAX,
+ RT_CLASS_125,
+ RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+} RT_SIZE_CLASS_ELEM;
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+ [RT_CLASS_3] = {
+ .name = "radix tree node 3",
+ .fanout = 3,
+ .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_32_MIN] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_32_MAX] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_125] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(RT_NODE_INNER_256),
+ .leaf_size = sizeof(RT_NODE_LEAF_256),
+ },
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+ RT_HANDLE handle;
+ uint32 magic;
+ LWLock lock;
+#endif
+
+ RT_PTR_ALLOC root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+ MemoryContext context;
+
+ /* pointing to either local memory or DSA */
+ RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+ dsa_area *dsa;
+#else
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+ RT_PTR_LOCAL node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+ RT_RADIX_TREE *tree;
+
+ /* Track the iteration on nodes of each level */
+ RT_NODE_ITER stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is constructed during iteration */
+ uint64 key;
+} RT_ITER;
+
+
+static void RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_VALUE_TYPE *value_p);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+ return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+ return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+ return DsaPointerIsValid(ptr);
+#else
+ return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ /* replicate the search key */
+ spread_chunk = vector8_broadcast(chunk);
+
+ /* compare to all 32 keys stored in the node */
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+
+ /* convert comparison to a bitfield */
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+ /* mask off invalid entries */
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ /* convert bitfield to index by counting trailing zeros */
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ /*
+ * This is coded with '>=' to match what we can do with SIMD,
+ * with an assert to keep us honest.
+ */
+ if (node->chunks[index] >= chunk)
+ {
+ Assert(node->chunks[index] != chunk);
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ /*
+ * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+ * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+ * we need to play some trickery using vector8_min() to effectively get
+ * >=. There'll never be any equal elements in current uses, but that's
+ * what we get here...
+ */
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+ uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+ uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+ Assert(RT_NODE_IS_LEAF(node));
+ Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+ return node->children[chunk];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ Assert(RT_NODE_IS_LEAF(node));
+ Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ node->isset[idx] |= ((bitmapword) 1 << bitnum);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+ if (key == 0)
+ return 0;
+ else
+ return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+ RT_PTR_ALLOC allocnode;
+ size_t allocsize;
+
+ if (is_leaf)
+ allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+ else
+ allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+ allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+ if (is_leaf)
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ allocsize);
+ else
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+ allocsize);
+#endif
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->ctl->cnt[size_class]++;
+#endif
+
+ return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+ if (is_leaf)
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+ else
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+
+ node->kind = kind;
+
+ if (kind == RT_NODE_KIND_256)
+ /* See comment for the RT_NODE type */
+ Assert(node->fanout == 0);
+ else
+ node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+ memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
+ }
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static pg_noinline void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+ int shift = RT_KEY_GET_SHIFT(key);
+ bool is_leaf = shift == 0;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ newnode->shift = shift;
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+ tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->count = oldnode->count;
+}
+
+/*
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
+ */
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+ uint8 new_kind, uint8 new_class, bool is_leaf)
+{
+ RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
+ RT_COPY_NODE(newnode, node);
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->ctl->root == allocnode)
+ {
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+ tree->ctl->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->ctl->cnt[i]--;
+ Assert(tree->ctl->cnt[i] >= 0);
+ }
+#endif
+
+#ifdef RT_SHMEM
+ dsa_free(tree->dsa, allocnode);
+#else
+ pfree(allocnode);
+#endif
+}
+
+/* Update the parent's pointer when growing a node */
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static inline void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
+ RT_PTR_ALLOC new_child, uint64 key)
+{
+#ifdef USE_ASSERT_CHECKING
+ RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+ Assert(old_child->shift == new->shift);
+ Assert(old_child->count == new->count);
+#endif
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new larger node */
+ tree->ctl->root = new_child;
+ }
+ else
+ RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+ RT_FREE_NODE(tree, stored_old_child);
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static pg_noinline void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+ int target_shift;
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ int shift = root->shift + RT_NODE_SPAN;
+
+ target_shift = RT_KEY_GET_SHIFT(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ RT_NODE_INNER_3 *n3;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+ node->shift = shift;
+ node->count = 1;
+
+ n3 = (RT_NODE_INNER_3 *) node;
+ n3->base.chunks[0] = 0;
+ n3->children[0] = tree->ctl->root;
+
+ /* Update the root */
+ tree->ctl->root = allocnode;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static pg_noinline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
+{
+ int shift = node->shift;
+
+ Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ RT_PTR_ALLOC allocchild;
+ RT_PTR_LOCAL newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool is_leaf = newshift == 0;
+
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ newchild->shift = newshift;
+ RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
+
+ parent = node;
+ node = newchild;
+ stored_node = allocchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value_p);
+ tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static void
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+ RT_RADIX_TREE *tree;
+ MemoryContext old_ctx;
+#ifdef RT_SHMEM
+ dsa_pointer dp;
+#endif
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+ tree->context = ctx;
+
+#ifdef RT_SHMEM
+ tree->dsa = dsa;
+ dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+ tree->ctl->handle = dp;
+ tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+ LWLockInitialize(&tree->ctl->lock, tranche_id);
+#else
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+ /* Create a slab context for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+ size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+ size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ size_class.name,
+ inner_blocksize,
+ size_class.inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ size_class.name,
+ leaf_blocksize,
+ size_class.leaf_size);
+ }
+#endif
+
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+ RT_RADIX_TREE *tree;
+ dsa_pointer control;
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ tree->dsa = dsa;
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();
+
+ /* The leaf node doesn't have child pointers */
+ if (RT_NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->dsa, ptr);
+ return;
+ }
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+ for (int i = 0; i < n3->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n3->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ for (int i = 0; i < n32->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+ }
+
+ break;
+ }
+ }
+
+ /* Free the inner node */
+ dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ /* Free all memory used for radix tree nodes */
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_FREE_RECURSE(tree, tree->ctl->root);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->dsa, tree->ctl->handle);
+#else
+ pfree(tree->ctl);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+#endif
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+ int shift;
+ bool updated;
+ RT_PTR_LOCAL parent;
+ RT_PTR_ALLOC stored_child;
+ RT_PTR_LOCAL child;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ RT_LOCK_EXCLUSIVE(tree);
+
+ /* Empty tree, create the root */
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_NEW_ROOT(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->ctl->max_val)
+ RT_EXTEND(tree, key);
+
+ stored_child = tree->ctl->root;
+ parent = RT_PTR_GET_LOCAL(tree, stored_child);
+ shift = parent->shift;
+
+ /* Descend the tree until we reach a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;
+
+ child = RT_PTR_GET_LOCAL(tree, stored_child);
+
+ if (RT_NODE_IS_LEAF(child))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
+ {
+ RT_SET_EXTEND(tree, key, value_p, parent, stored_child, child);
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ parent = child;
+ stored_child = new_child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value_p);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->ctl->num_keys++;
+
+ RT_UNLOCK(tree);
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *value_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+ bool found;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+ Assert(value_p != NULL);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ shift = node->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+ if (RT_NODE_IS_LEAF(node))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ node = RT_PTR_GET_LOCAL(tree, child);
+ shift -= RT_NODE_SPAN;
+ }
+
+ found = RT_NODE_SEARCH_LEAF(node, key, value_p);
+
+ RT_UNLOCK(tree);
+ return found;
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ RT_LOCK_EXCLUSIVE(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+ /* Push the current node to the stack */
+ stack[++level] = allocnode;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ allocnode = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_LEAF(node, key);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->ctl->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (node->count > 0)
+ {
+ RT_UNLOCK(tree);
+ return true;
+ }
+
+ /* Free the empty leaf node */
+ RT_FREE_NODE(tree, allocnode);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ allocnode = stack[level--];
+
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_INNER(node, key);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (node->count > 0)
+ break;
+
+ /* The node became empty */
+ RT_FREE_NODE(tree, allocnode);
+ }
+
+ RT_UNLOCK(tree);
+ return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+ RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+ int level = from;
+ RT_PTR_LOCAL node = from_node;
+
+ for (;;)
+ {
+ RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (RT_NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/*
+ * Create and return the iterator for the given radix tree.
+ *
+ * The radix tree is locked in shared mode during the iteration, so
+ * RT_END_ITERATE needs to be called when finished to release the lock.
+ */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+ MemoryContext old_ctx;
+ RT_ITER *iter;
+ RT_PTR_LOCAL root;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+ iter->tree = tree;
+
+ RT_LOCK_SHARED(tree);
+
+ /* empty tree */
+ if (!iter->tree->ctl->root)
+ return iter;
+
+ root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+ top_level = root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->ctl->root)
+ return false;
+
+ for (;;)
+ {
+ RT_PTR_LOCAL child = NULL;
+ RT_VALUE_TYPE value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+/*
+ * Terminate the iteration and release the lock.
+ *
+ * This function needs to be called after finishing or when exiting an
+ * iteration.
+ */
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+#ifdef RT_SHMEM
+ Assert(LWLockHeldByMe(&iter->tree->ctl->lock));
+#endif
+
+ RT_UNLOCK(iter->tree);
+ pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+ Size total = 0;
+
+ RT_LOCK_SHARED(tree);
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ total = dsa_get_total_size(tree->dsa);
+#else
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+#endif
+
+ RT_UNLOCK(tree);
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
+
+ for (int i = 1; i < n3->n.count; i++)
+ Assert(n3->chunks[i - 1] < n3->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ uint8 slot = n125->slot_idxs[i];
+ int idx = RT_BM_IDX(slot);
+ int bitnum = RT_BM_BIT(slot);
+
+ if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(slot < node->fanout);
+ Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
+ cnt += bmw_popcount(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+ RT_LOCK_SHARED(tree);
+
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+ fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+ fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+
+ fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+ root->shift / RT_NODE_SPAN,
+ tree->ctl->cnt[RT_CLASS_3],
+ tree->ctl->cnt[RT_CLASS_32_MIN],
+ tree->ctl->cnt[RT_CLASS_32_MAX],
+ tree->ctl->cnt[RT_CLASS_125],
+ tree->ctl->cnt[RT_CLASS_256]);
+ }
+
+ RT_UNLOCK(tree);
+}
+
+static void
+RT_DUMP_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, int level,
+ bool recurse, StringInfo buf)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+ StringInfoData spaces;
+
+ initStringInfo(&spaces);
+ appendStringInfoSpaces(&spaces, (level * 4) + 1);
+
+ appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u, shift %u:\n",
+ spaces.data,
+ level == 0 ? "" : "-> ",
+ RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_3) ? 3 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+ spaces.data, i, n3->base.chunks[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X",
+ spaces.data, i, n3->base.chunks[i]);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, n3->children[i], level + 1,
+ recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+ spaces.data, i, n32->base.chunks[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X",
+ spaces.data, i, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, n32->children[i], level + 1,
+ recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+ char *sep = "";
+
+ appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ appendStringInfo(buf, "%s[%d]=%d ",
+ sep, i, b125->slot_idxs[i]);
+ sep = ",";
+ }
+
+ appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+ for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+ appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+ appendStringInfo(buf, "\n");
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ if (RT_NODE_IS_LEAF(node))
+ appendStringInfo(buf, "%schunk 0x%X\n",
+ spaces.data, i);
+ else
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+ appendStringInfo(buf, "%schunk 0x%X",
+ spaces.data, i);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i),
+ level + 1, recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+ for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+ appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+ appendStringInfo(buf, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ appendStringInfo(buf, "%schunk 0x%X\n",
+ spaces.data, i);
+ }
+ else
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ appendStringInfo(buf, "%schunk 0x%X",
+ spaces.data, i);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i),
+ level + 1, recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ StringInfoData buf;
+ int shift;
+ int level = 0;
+
+ RT_STATS(tree);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ if (key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+ key, key);
+ return;
+ }
+
+ initStringInfo(&buf);
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child;
+
+ RT_DUMP_NODE(tree, allocnode, level, false, &buf);
+
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_VALUE_TYPE dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+ break;
+ }
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ break;
+
+ allocnode = child;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+ RT_UNLOCK(tree);
+
+ fprintf(stderr, "%s", buf.data);
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+ StringInfoData buf;
+
+ RT_STATS(tree);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ initStringInfo(&buf);
+
+ RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+ RT_UNLOCK(tree);
+
+ fprintf(stderr, "%s",buf.data);
+}
+#endif
+
+#endif /* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef RT_BM_IDX
+#undef RT_BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_3
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_3
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_3
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
+#undef RT_CLASS_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_SWITCH_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..5f6dda1f12
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,122 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_delete_impl.h
+ * Common implementation for deletion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ * TODO: Shrink nodes when deletion would allow them to fit in a smaller
+ * size class.
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_delete_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+ n3->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+ n3->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
+ n32->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int idx;
+ int bitnum;
+
+ if (slotpos == RT_INVALID_SLOT_IDX)
+ return false;
+
+ idx = RT_BM_IDX(slotpos);
+ bitnum = RT_BM_BIT(slotpos);
+ n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+ n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+ RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+ break;
+ }
+ }
+
+ /* update statistics */
+ node->count--;
+
+ return true;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..d56e58dcac
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,328 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_insert_impl.h
+ * Common implementation for insertion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_insert_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ const bool is_leaf = true;
+ bool chunk_exists = false;
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+ const bool is_leaf = false;
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ int idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n3->values[idx] = *value_p;
+ break;
+ }
+#endif
+ if (unlikely(RT_NODE_MUST_GROW(n3)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE32_TYPE *new32;
+ const uint8 new_kind = RT_NODE_KIND_32;
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
+
+ /* grow node from 3 to 32 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
+ new32->base.chunks, new32->values);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
+ new32->base.chunks, new32->children);
+#endif
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+ int count = n3->base.n.count;
+
+ /* shift chunks and children */
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
+ count, insertpos);
+#endif
+ }
+
+ n3->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n3->values[insertpos] = *value_p;
+#else
+ n3->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ int idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->values[idx] = *value_p;
+ break;
+ }
+#endif
+ if (unlikely(RT_NODE_MUST_GROW(n32)) &&
+ n32->base.n.fanout < class32_max.fanout)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+ Assert(n32->base.n.fanout == class32_min.fanout);
+
+ /* grow to the next size class of this kind */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ n32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ memcpy(newnode, node, class32_min.leaf_size);
+#else
+ memcpy(newnode, node, class32_min.inner_size);
+#endif
+ newnode->fanout = class32_max.fanout;
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n32)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE125_TYPE *new125;
+ const uint8 new_kind = RT_NODE_KIND_125;
+ const RT_SIZE_CLASS new_class = RT_CLASS_125;
+
+ Assert(n32->base.n.fanout == class32_max.fanout);
+
+ /* grow node from 32 to 125 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new125 = (RT_NODE125_TYPE *) newnode;
+
+ for (int i = 0; i < class32_max.fanout; i++)
+ {
+ new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ new125->values[i] = n32->values[i];
+#else
+ new125->children[i] = n32->children[i];
+#endif
+ }
+
+ /*
+ * Since we just copied a dense array, we can set the bits
+ * using a single store, provided the length of that array
+ * is at most the number of bits in a bitmapword.
+ */
+ Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+ new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+ int count = n32->base.n.count;
+
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+ count, insertpos);
+#endif
+ }
+
+ n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = *value_p;
+#else
+ n32->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos;
+ int cnt = 0;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ slotpos = n125->base.slot_idxs[chunk];
+ if (slotpos != RT_INVALID_SLOT_IDX)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n125->values[slotpos] = *value_p;
+ break;
+ }
+#endif
+ if (unlikely(RT_NODE_MUST_GROW(n125)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE256_TYPE *new256;
+ const uint8 new_kind = RT_NODE_KIND_256;
+ const RT_SIZE_CLASS new_class = RT_CLASS_256;
+
+ /* grow node from 125 to 256 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new256 = (RT_NODE256_TYPE *) newnode;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+ RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ cnt++;
+ }
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int idx;
+ bitmapword inverse;
+
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < RT_BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+ {
+ if (n125->base.isset[idx] < ~((bitmapword) 0))
+ break;
+ }
+
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(n125->base.isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+ Assert(slotpos < node->fanout);
+
+ /* mark the slot used */
+ n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+ n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = *value_p;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+ Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
+ RT_NODE_LEAF_256_SET(n256, chunk, *value_p);
+#else
+ Assert(node->count < RT_NODE_MAX_SLOTS);
+ RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+ break;
+ }
+ }
+
+ /* Update statistics */
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!chunk_exists)
+ node->count++;
+#else
+ node->count++;
+#endif
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ RT_VERIFY_NODE(node);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return chunk_exists;
+#else
+ return;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..98c78eb237
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,153 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_iter_impl.h
+ * Common implementation for iteration in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_iter_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ bool found = false;
+ uint8 key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_VALUE_TYPE value;
+
+ Assert(RT_NODE_IS_LEAF(node_iter->node));
+#else
+ RT_PTR_LOCAL child = NULL;
+
+ Assert(!RT_NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+ Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n3->base.n.count)
+ break;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n3->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+#endif
+ key_chunk = n3->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+#ifdef RT_NODE_LEVEL_LEAF
+ if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+ if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = value;
+#endif
+ }
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return found;
+#else
+ return child;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..a8925c75d0
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,138 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_search_impl.h
+ * Common implementation for search in leaf and inner nodes, plus
+ * update for inner nodes only.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_search_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(value_p != NULL);
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+ Assert(child_p != NULL);
+#endif
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n3->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = n3->values[idx];
+#else
+ *child_p = n3->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n32->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = n32->values[idx];
+#else
+ *child_p = n32->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+ Assert(slotpos != RT_INVALID_SLOT_IDX);
+ n125->children[slotpos] = new_child;
+#else
+ if (slotpos == RT_INVALID_SLOT_IDX)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+ *child_p = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+ RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+ *child_p = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ }
+
+#ifdef RT_ACTION_UPDATE
+ return;
+#else
+ return true;
+#endif /* RT_ACTION_UPDATE */
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..2af215484f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,6 +121,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
test_pg_db_role_setting \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
subdir('test_pg_db_role_setting')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..afe53382f3
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,681 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int rt_node_kind_fanouts[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 125, /* RT_NODE_KIND_125 */
+ 256 /* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+/* #define RT_SHMEM */
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ TestValueType dummy;
+ uint64 key;
+ TestValueType val;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_radix_tree");
+ dsa = dsa_create(tranche_id);
+
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ rt_radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_radix_tree");
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* look up keys */
+ for (int i = 0; i < children; i++)
+ {
+ TestValueType value;
+
+ if (!rt_search(radixtree, keys[i], &value))
+ elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (value != (TestValueType) keys[i])
+ elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+ value, (TestValueType) keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ TestValueType update = keys[i] + 1;
+ if (!rt_set(radixtree, keys[i], (TestValueType*) &update))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ TestValueType val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != (TestValueType) key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, (TestValueType*) &key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx - 1]
+ : rt_node_kind_fanouts[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx]
+ : rt_node_kind_fanouts[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_radix_tree");
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_radix_tree");
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+#else
+ radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, (TestValueType*) &x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ TestValueType v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != (TestValueType) x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ TestValueType val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != (TestValueType) expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ TestValueType v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ rt_free(radixtree);
+ MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ test_basic(rt_node_kind_fanouts[i], false);
+ test_basic(rt_node_kind_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
--
2.31.1
v29-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchapplication/octet-stream; name=v29-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload
From 7a4bf52d585e41926b6a85cb7ae64be177cc0d04 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v29 04/10] Add TIDStore, to store sets of TIDs
(ItemPointerData) efficiently.
The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.
The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.
This includes a unit test module, in src/test/modules/test_tidstore.
---
doc/src/sgml/monitoring.sgml | 4 +
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 681 ++++++++++++++++++
src/backend/storage/lmgr/lwlock.c | 2 +
src/include/access/tidstore.h | 49 ++
src/include/storage/lwlock.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_tidstore/Makefile | 23 +
.../test_tidstore/expected/test_tidstore.out | 13 +
src/test/modules/test_tidstore/meson.build | 35 +
.../test_tidstore/sql/test_tidstore.sql | 7 +
.../test_tidstore/test_tidstore--1.0.sql | 8 +
.../modules/test_tidstore/test_tidstore.c | 226 ++++++
.../test_tidstore/test_tidstore.control | 4 +
16 files changed, 1057 insertions(+)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
create mode 100644 src/test/modules/test_tidstore/Makefile
create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
create mode 100644 src/test/modules/test_tidstore/meson.build
create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
create mode 100644 src/test/modules/test_tidstore/test_tidstore.control
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dca50707ad..e28206e056 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2198,6 +2198,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Waiting to access a shared TID bitmap during a parallel bitmap
index scan.</entry>
</row>
+ <row>
+ <entry><literal>SharedTidStore</literal></entry>
+ <entry>Waiting to access a shared TID store.</entry>
+ </row>
<row>
<entry><literal>SharedTupleStore</literal></entry>
<entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..8c05e60d92
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,681 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach().
+ *
+ * Regarding the concurrency, it basically relies on the concurrency support in
+ * the radix tree, but we acquires the lock on a TidStore in some cases, for
+ * example, when to reset the store and when to access the number tids in the
+ * store (num_tids).
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, tids are represented as a pair of 64-bit key and
+ * 64-bit value. First, we construct 64-bit unsigned integer by combining
+ * the block number and the offset number. The number of bits used for the
+ * offset number is specified by max_offsets in tidstore_create(). We are
+ * frugal with the bits, because smaller keys could help keeping the radix
+ * tree shallow.
+ *
+ * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. That
+ * is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits
+ * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
+ * as the key:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ * |----| value
+ * |---------------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ */
+#define TIDSTORE_VALUE_NBITS 6 /* log(64, 2) */
+#define TIDSTORE_OFFSET_MASK ((1 << TIDSTORE_VALUE_NBITS) - 1)
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The control object for a TidStore */
+typedef struct TidStoreControl
+{
+ /* the number of tids in the store */
+ int64 num_tids;
+
+ /* These values are never changed after creation */
+ size_t max_bytes; /* the maximum bytes a TidStore can use */
+ int max_offset; /* the maximum offset number */
+ int offset_nbits; /* the number of bits required for an offset
+ * number */
+ int offset_key_nbits; /* the number of bits of an offset number
+ * used in a key */
+
+ /* The below fields are used only in shared case */
+
+ uint32 magic;
+ LWLock lock;
+
+ /* handles for TidStore and radix tree */
+ tidstore_handle handle;
+ shared_rt_handle tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+ /*
+ * Control object. This is allocated in DSA area 'area' in the shared
+ * case, otherwise in backend-local memory.
+ */
+ TidStoreControl *control;
+
+ /* Storage for Tids. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ local_rt_radix_tree *local;
+ shared_rt_radix_tree *shared;
+ } tree;
+
+ /* DSA area for TidStore if used */
+ dsa_area *area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+ TidStore *ts;
+
+ /* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ shared_rt_iter *shared;
+ local_rt_iter *local;
+ } tree_iter;
+
+ /* we returned all tids? */
+ bool finished;
+
+ /* save for the next iteration */
+ uint64 next_key;
+ uint64 next_val;
+
+ /* output for the caller */
+ TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+{
+ TidStore *ts;
+
+ ts = palloc0(sizeof(TidStore));
+
+ /*
+ * Create the radix tree for the main storage.
+ *
+ * Memory consumption depends on the number of stored tids, but also on the
+ * distribution of them, how the radix tree stores, and the memory management
+ * that backed the radix tree. The maximum bytes that a TidStore can
+ * use is specified by the max_bytes in tidstore_create(). We want the total
+ * amount of memory consumption by a TidStore not to exceed the max_bytes.
+ *
+ * In local TidStore cases, the radix tree uses slab allocators for each kind
+ * of node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+ * slab block for a new radix tree node, which is approximately 70kB. Therefore,
+ * we deduct 70kB from the max_bytes.
+ *
+ * In shared cases, DSA allocates the memory segments big enough to follow
+ * a geometric series that approximately doubles the total DSA size (see
+ * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+ * size and the simulation revealed, the 75% threshold for the maximum bytes
+ * perfectly works in case where the max_bytes is a power-of-2, and the 60%
+ * threshold works for other cases.
+ */
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+ float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+ LWTRANCHE_SHARED_TIDSTORE);
+
+ dp = dsa_allocate0(area, sizeof(TidStoreControl));
+ ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+ ts->control->max_bytes = (uint64) (max_bytes * ratio);
+ ts->area = area;
+
+ ts->control->magic = TIDSTORE_MAGIC;
+ LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+ ts->control->handle = dp;
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+ }
+ else
+ {
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+ ts->control->max_bytes = max_bytes - (70 * 1024);
+ }
+
+ ts->control->max_offset = max_offset;
+ ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+
+ if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
+ ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
+
+ ts->control->offset_key_nbits =
+ ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+
+ return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ /* create per-backend state */
+ ts = palloc0(sizeof(TidStore));
+
+ /* Find the control object in shared memory */
+ control = handle;
+
+ /* Set up the TidStore */
+ ts->control = (TidStoreControl *) dsa_get_address(area, control);
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+ ts->area = area;
+
+ return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ shared_rt_detach(ts->tree.shared);
+ pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix
+ * tree.
+ */
+ ts->control->magic = 0;
+ dsa_free(ts->area, ts->control->handle);
+ shared_rt_free(ts->tree.shared);
+ }
+ else
+ {
+ pfree(ts->control);
+ local_rt_free(ts->tree.local);
+ }
+
+ pfree(ts);
+}
+
+/*
+ * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * entire TidStore but recreate only the radix tree storage.
+ */
+void
+tidstore_reset(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /*
+ * Free the radix tree and return allocated DSA segments to
+ * the operating system.
+ */
+ shared_rt_free(ts->tree.shared);
+ dsa_trim(ts->area);
+
+ /* Recreate the radix tree */
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+ LWTRANCHE_SHARED_TIDSTORE);
+
+ /* update the radix tree handle as we recreated it */
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+
+ LWLockRelease(&ts->control->lock);
+ }
+ else
+ {
+ local_rt_free(ts->tree.local);
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+ }
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ uint64 *values;
+ uint64 key;
+ uint64 prev_key;
+ uint64 off_bitmap = 0;
+ int idx;
+ const uint64 key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+ const int nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ values = palloc(sizeof(uint64) * nkeys);
+ key = prev_key = key_base;
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint64 off_bit;
+
+ /* encode the tid to a key and partial offset */
+ key = encode_key_off(ts, blkno, offsets[i], &off_bit);
+
+ /* make sure we scanned the line pointer array in order */
+ Assert(key >= prev_key);
+
+ if (key > prev_key)
+ {
+ idx = prev_key - key_base;
+ Assert(idx >= 0 && idx < nkeys);
+
+ /* write out offset bitmap for this key */
+ values[idx] = off_bitmap;
+
+ /* zero out any gaps up to the current key */
+ for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
+ values[empty_idx] = 0;
+
+ /* reset for current key -- the current offset will be handled below */
+ off_bitmap = 0;
+ prev_key = key;
+ }
+
+ off_bitmap |= off_bit;
+ }
+
+ /* save the final index for later */
+ idx = key - key_base;
+ /* write out last offset bitmap */
+ values[idx] = off_bitmap;
+
+ if (TidStoreIsShared(ts))
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /* insert the calculated key-values to the tree */
+ for (int i = 0; i <= idx; i++)
+ {
+ if (values[i])
+ {
+ key = key_base + i;
+
+ if (TidStoreIsShared(ts))
+ shared_rt_set(ts->tree.shared, key, &values[i]);
+ else
+ local_rt_set(ts->tree.local, key, &values[i]);
+ }
+ }
+
+ /* update statistics */
+ ts->control->num_tids += num_offsets;
+
+ if (TidStoreIsShared(ts))
+ LWLockRelease(&ts->control->lock);
+
+ pfree(values);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val = 0;
+ uint64 off_bit;
+ bool found;
+
+ key = tid_to_key_off(ts, tid, &off_bit);
+
+ if (TidStoreIsShared(ts))
+ found = shared_rt_search(ts->tree.shared, key, &val);
+ else
+ found = local_rt_search(ts->tree.local, key, &val);
+
+ if (!found)
+ return false;
+
+ return (val & off_bit) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during the
+ * iteration, so tidstore_end_iterate() needs to called when finished.
+ *
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+ TidStoreIter *iter;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ iter = palloc0(sizeof(TidStoreIter));
+ iter->ts = ts;
+
+ iter->result.blkno = InvalidBlockNumber;
+ iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
+
+ if (TidStoreIsShared(ts))
+ iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+ else
+ iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+ /* If the TidStore is empty, there is no business */
+ if (tidstore_num_tids(ts) == 0)
+ iter->finished = true;
+
+ return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+
+ return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a pointer to TidStoreIterResult that has tids
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+ TidStoreIterResult *result = &(iter->result);
+
+ if (iter->finished)
+ return NULL;
+
+ if (BlockNumberIsValid(result->blkno))
+ {
+ /* Process the previously collected key-value */
+ result->num_offsets = 0;
+ tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (tidstore_iter_kv(iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = key_get_blkno(iter->ts, key);
+
+ if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+ {
+ /*
+ * We got a key-value pair for a different block. So return the
+ * collected tids, and remember the key-value for the next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+ return result;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_extract_tids(iter, key, val);
+ }
+
+ iter->finished = true;
+ return result;
+}
+
+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+ if (TidStoreIsShared(iter->ts))
+ shared_rt_end_iterate(iter->tree_iter.shared);
+ else
+ local_rt_end_iterate(iter->tree_iter.local);
+
+ pfree(iter->result.offsets);
+ pfree(iter);
+}
+
+/* Return the number of tids we collected so far */
+int64
+tidstore_num_tids(TidStore *ts)
+{
+ uint64 num_tids;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ if (!TidStoreIsShared(ts))
+ return ts->control->num_tids;
+
+ LWLockAcquire(&ts->control->lock, LW_SHARED);
+ num_tids = ts->control->num_tids;
+ LWLockRelease(&ts->control->lock);
+
+ return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+tidstore_max_memory(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+tidstore_memory_usage(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+ return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->handle;
+}
+
+/* Extract tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+ TidStoreIterResult *result = (&iter->result);
+
+ while (val)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= pg_rightmost_one_pos64(val);
+
+ off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+
+ Assert(result->num_offsets < iter->ts->control->max_offset);
+ result->offsets[result->num_offsets++] = off;
+
+ /* unset the rightmost bit */
+ val &= ~pg_rightmost_one64(val);
+ }
+
+ result->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, uint64 key)
+{
+ return (BlockNumber) (key >> ts->control->offset_key_nbits);
+}
+
+/* Encode a tid to key and offset */
+static inline uint64
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit)
+{
+ uint32 offset = ItemPointerGetOffsetNumber(tid);
+ BlockNumber block = ItemPointerGetBlockNumber(tid);
+
+ return encode_key_off(ts, block, offset, off_bit);
+}
+
+/* encode a block and offset to a key and partial offset */
+static inline uint64
+encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit)
+{
+ uint64 key;
+ uint64 tid_i;
+ uint32 off_lower;
+
+ off_lower = offset & TIDSTORE_OFFSET_MASK;
+ Assert(off_lower < (sizeof(uint64) * BITS_PER_BYTE));
+
+ *off_bit = UINT64CONST(1) << off_lower;
+ tid_i = offset | ((uint64) block << ts->control->offset_nbits);
+ key = tid_i >> TIDSTORE_VALUE_NBITS;
+
+ return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
"SharedTupleStore",
/* LWTRANCHE_SHARED_TIDBITMAP: */
"SharedTidBitmap",
+ /* LWTRANCHE_SHARED_TIDSTORE: */
+ "SharedTidStore",
/* LWTRANCHE_PARALLEL_APPEND: */
"ParallelAppend",
/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..a35a52124a
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+ BlockNumber blkno;
+ OffsetNumber *offsets;
+ int num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern int64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern size_t tidstore_max_memory(TidStore *ts);
+extern size_t tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif /* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
LWTRANCHE_SHARED_TUPLESTORE,
LWTRANCHE_SHARED_TIDBITMAP,
+ LWTRANCHE_SHARED_TIDSTORE,
LWTRANCHE_PARALLEL_APPEND,
LWTRANCHE_PER_XACT_PREDICATE_LIST,
LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
test_rls_hooks \
test_shm_mq \
test_slru \
+ test_tidstore \
unsafe_tests \
worker_spi
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
subdir('test_rls_hooks')
subdir('test_shm_mq')
subdir('test_slru')
+subdir('test_tidstore')
subdir('unsafe_tests')
subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+ $(WIN32RES) \
+ test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE: testing empty tidstore
+NOTICE: testing basic operations
+ test_tidstore
+---------------
+
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+ 'test_tidstore.c',
+)
+
+if host_system == 'windows'
+ test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_tidstore',
+ '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+ test_tidstore_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+ 'test_tidstore.control',
+ 'test_tidstore--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_tidstore',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_tidstore',
+ ],
+ },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..9a1217f833
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,226 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ * Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+/* #define TEST_SHARED_TIDSTORE 1 */
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+ ItemPointerData tid;
+ bool found;
+
+ ItemPointerSet(&tid, blkno, off);
+
+ found = tidstore_lookup_tid(ts, &tid);
+
+ if (found != expect)
+ elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+ blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS 5
+#define TEST_TIDSTORE_NUM_OFFSETS 5
+
+ TidStore *ts;
+ TidStoreIter *iter;
+ TidStoreIterResult *iter_result;
+ BlockNumber blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+ };
+ BlockNumber blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+ };
+ OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+ int blk_idx;
+
+#ifdef TEST_SHARED_TIDSTORE
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_tidstore");
+ dsa = dsa_create(tranche_id);
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
+#else
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+#endif
+
+ /* prepare the offset array */
+ offs[0] = FirstOffsetNumber;
+ offs[1] = FirstOffsetNumber + 1;
+ offs[2] = max_offset / 2;
+ offs[3] = max_offset - 1;
+ offs[4] = max_offset;
+
+ /* add tids */
+ for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+ tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* lookup test */
+ for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+ {
+ bool expect = false;
+ for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+ {
+ if (offs[i] == off)
+ {
+ expect = true;
+ break;
+ }
+ }
+
+ check_tid(ts, 0, off, expect);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, expect);
+ }
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+ elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+ tidstore_num_tids(ts),
+ TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* iteration test */
+ iter = tidstore_begin_iterate(ts);
+ blk_idx = 0;
+ while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+ {
+ /* check the returned block number */
+ if (blks_sorted[blk_idx] != iter_result->blkno)
+ elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+ iter_result->blkno, blks_sorted[blk_idx]);
+
+ /* check the returned offset numbers */
+ if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+ elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+ iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+ for (int i = 0; i < iter_result->num_offsets; i++)
+ {
+ if (offs[i] != iter_result->offsets[i])
+ elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+ iter_result->offsets[i], iter_result->blkno, offs[i]);
+ }
+
+ blk_idx++;
+ }
+
+ if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+ elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+ blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+ /* remove all tids */
+ tidstore_reset(ts);
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+ /* lookup test for empty store */
+ for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+ off++)
+ {
+ check_tid(ts, 0, off, false);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, false);
+ }
+
+ tidstore_destroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+ dsa_detach(dsa);
+#endif
+}
+
+static void
+test_empty(void)
+{
+ TidStore *ts;
+ TidStoreIter *iter;
+ ItemPointerData tid;
+
+#ifdef TEST_SHARED_TIDSTORE
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_tidstore");
+ dsa = dsa_create(tranche_id);
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
+#else
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+#endif
+
+ elog(NOTICE, "testing empty tidstore");
+
+ ItemPointerSet(&tid, 0, FirstOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+ ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+ MaxBlockNumber, MaxOffsetNumber);
+
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+ if (tidstore_is_full(ts))
+ elog(ERROR, "tidstore_is_full on empty store returned true");
+
+ iter = tidstore_begin_iterate(ts);
+
+ if (tidstore_iterate_next(iter) != NULL)
+ elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+ tidstore_end_iterate(iter);
+
+ tidstore_destroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+ dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ elog(NOTICE, "testing basic operations");
+ test_basic(MaxHeapTuplesPerPage);
+ test_basic(10);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
--
2.31.1
On Mon, Feb 20, 2023 at 2:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Feb 16, 2023 at 6:23 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Feb 14, 2023 at 8:24 PM John Naylor
<john.naylor@enterprisedb.com> wrote:I can think that something like traversing a HOT chain could visit
offsets out of order. But fortunately we prune such collected TIDs
before heap vacuum in heap case.Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just continue assuming that (with an assert added since it's more public in this form). I'm not sure why such basic common sense evaded me a few versions ago...
Right. TidStore is implemented not only for heap, so loading
out-of-order TIDs might be important in the future.That's what I was probably thinking about some weeks ago, but I'm having a hard time imagining how it would come up, even for something like the conveyor-belt concept.
We have the following WIP comment in test_radixtree:
// WIP: compiles with warnings because rt_attach is defined but not used
// #define RT_SHMEMHow about unsetting RT_SCOPE to suppress warnings for unused rt_attach
and friends?Sounds good to me, and the other fixes make sense as well.
Thanks, I merged them.
FYI I've briefly tested the TidStore with blocksize = 32kb, and it
seems to work fine.That was on my list, so great! How about the other end -- nominally we allow 512b. (In practice it won't matter, but this would make sure I didn't mess anything up when forcing all MaxTuplesPerPage to encode.)
According to the doc, the minimum block size is 1kB. It seems to work
fine with 1kB blocks.You removed the vacuum integration patch from v27, is there any reason for that?
Just an oversight.
Now for some general comments on the tid store...
+ * TODO: The caller must be certain that no other backend will attempt to + * access the TidStore before calling this function. Other backend must + * explicitly call tidstore_detach to free up backend-local memory associated + * with the TidStore. The backend that calls tidstore_destroy must not call + * tidstore_detach. + */ +void +tidstore_destroy(TidStore *ts)Do we need to do anything for this todo?
Since it's practically no problem, I think we can live with it for
now. dshash also has the same todo.It might help readability to have a concept of "off_upper/off_lower", just so we can describe things more clearly. The key is block + off_upper, and the value is a bitmap of all the off_lower bits. I hinted at that in my addition of encode_key_off(). Along those lines, maybe s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even sure the TIDSTORE_ prefix is valuable for these local macros.
The word "value" as a variable name is pretty generic in this context, and it might be better to call it the off_lower_bitmap, at least in some places. The "key" doesn't have a good short term for naming, but in comments we should make sure we're clear it's "block# + off_upper".
I'm not a fan of the name "tid_i", even as a temp variable -- maybe "compressed_tid"?
maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/
It might be worth using typedefs for key and value type. Actually, since key type is fixed for the foreseeable future, maybe the radix tree template should define a key typedef?
The term "result" is probably fine within the tidstore, but as a public name used by vacuum, it's not very descriptive. I don't have a good idea, though.
Some files in backend/access use CamelCase for public functions, although it's not consistent. I think doing that for tidstore would help readability, since they would stand out from rt_* functions and vacuum functions. It's a matter of taste, though.
I don't understand the control flow in tidstore_iterate_next(), or when BlockNumberIsValid() is true. If this is the best way to code this, it needs more commentary.
The attached 0008 patch addressed all above comments on tidstore.
Some comments on vacuum:
I think we'd better get some real-world testing of this, fairly soon.
I had an idea: If it's not too much effort, it might be worth splitting it into two parts: one that just adds the store (not caring about its memory limits or progress reporting etc). During index scan, check both the new store and the array and log a warning (we don't want to exit or crash, better to try to investigate while live if possible) if the result doesn't match. Then perhaps set up an instance and let something like TPC-C run for a few days. The second patch would just restore the rest of the current patch. That would help reassure us it's working as designed.
Yeah, I did a similar thing in an earlier version of tidstore patch.
Since we're trying to introduce two new components: radix tree and
tidstore, I sometimes find it hard to investigate failures happening
during lazy (parallel) vacuum due to a bug either in tidstore or radix
tree. If there is a bug in lazy vacuum, we cannot even do initdb. So
it might be a good idea to do such checks in USE_ASSERT_CHECKING (or
with another macro say DEBUG_TIDSTORE) builds. For example, TidStore
stores tids to both the radix tree and array, and checks if the
results match when lookup or iteration. It will use more memory but it
would not be a big problem in USE_ASSERT_CHECKING builds. It would
also be great if we can enable such checks on some bf animals.
I've tried this idea. Enabling this check on all debug builds (i.e.,
with USE_ASSERT_CHECKING macro) seems not a good idea so I use a
special macro for that, TIDSTORE_DEBUG. I think we can define this
macro on some bf animals (or possibly a new bf animal).
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v29-0011-Debug-TIDStore.patch.txttext/plain; charset=US-ASCII; name=v29-0011-Debug-TIDStore.patch.txtDownload
From 107aa2af2966c10ce750e6b410ae570462423aab Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 22 Feb 2023 14:43:15 +0900
Subject: [PATCH v29 11/11] Debug TIDStore.
---
src/backend/access/common/tidstore.c | 242 ++++++++++++++++++++++++++-
1 file changed, 238 insertions(+), 4 deletions(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 9360520482..438bf0c800 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -28,12 +28,20 @@
#include "postgres.h"
#include "access/tidstore.h"
+#include "catalog/index.h"
#include "miscadmin.h"
#include "port/pg_bitutils.h"
#include "storage/lwlock.h"
#include "utils/dsa.h"
#include "utils/memutils.h"
+#define TIDSTORE_DEBUG
+
+/* Enable TidStore debugging only when USE_ASSERT_CHECKING */
+#if defined(TIDSTORE_DEBUG) && !defined(USE_ASSERT_CHECKING)
+#undef TIDSTORE_DEBUG
+#endif
+
/*
* For encoding purposes, a tid is represented as a pair of 64-bit key and
* 64-bit value.
@@ -115,6 +123,12 @@ typedef struct TidStoreControl
/* handles for TidStore and radix tree */
TidStoreHandle handle;
shared_rt_handle tree_handle;
+
+#ifdef TIDSTORE_DEBUG
+ dsm_handle tids_handle;
+ int64 max_tids;
+ bool tids_unordered;
+#endif
} TidStoreControl;
/* Per-backend state for a TidStore */
@@ -135,6 +149,11 @@ struct TidStore
/* DSA area for TidStore if used */
dsa_area *area;
+
+#ifdef TIDSTORE_DEBUG
+ dsm_segment *tids_seg;
+ ItemPointerData *tids;
+#endif
};
#define TidStoreIsShared(ts) ((ts)->area != NULL)
@@ -157,6 +176,11 @@ typedef struct TidStoreIter
tidkey next_tidkey;
offsetbm next_off_bitmap;
+#ifdef TIDSTORE_DEBUG
+ /* iterator index for the ts->tids array */
+ int64 tids_idx;
+#endif
+
/*
* output for the caller. Must be last because variable-size.
*/
@@ -169,6 +193,15 @@ static inline tidkey encode_blk_off(TidStore *ts, BlockNumber block,
OffsetNumber offset, offsetbm *off_bit);
static inline tidkey encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit);
+/* debug functions available only when TIDSTORE_DEBUG */
+#ifdef TIDSTORE_DEBUG
+static void ts_debug_set_block_offsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+static void ts_debug_iter_check_tids(TidStoreIter *iter);
+static bool ts_debug_is_member(TidStore *ts, ItemPointer tid);
+static int itemptr_cmp(const void *left, const void *right);
+#endif
+
/*
* Create a TidStore. The returned object is allocated in backend-local memory.
* The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
@@ -237,6 +270,26 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
ts->control->upper_off_nbits =
ts->control->max_off_nbits - LOWER_OFFSET_NBITS;
+#ifdef TIDSTORE_DEBUG
+ {
+ int64 max_tids = max_bytes / sizeof(ItemPointerData);
+
+ /* Allocate the array of tids too */
+ if (TidStoreIsShared(ts))
+ {
+ ts->tids_seg = dsm_create(sizeof(ItemPointerData) * max_tids, 0);
+ ts->tids = dsm_segment_address(ts->tids_seg);
+ ts->control->tids_handle = dsm_segment_handle(ts->tids_seg);
+ ts->control->max_tids = max_tids;
+ }
+ else
+ {
+ ts->tids = palloc(sizeof(ItemPointerData) * max_tids);
+ ts->control->max_tids = max_tids;
+ }
+ }
+#endif
+
return ts;
}
@@ -266,6 +319,11 @@ TidStoreAttach(dsa_area *area, TidStoreHandle handle)
ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
ts->area = area;
+#ifdef TIDSTORE_DEBUG
+ ts->tids_seg = dsm_attach(ts->control->tids_handle);
+ ts->tids = (ItemPointer) dsm_segment_address(ts->tids_seg);
+#endif
+
return ts;
}
@@ -280,6 +338,11 @@ TidStoreDetach(TidStore *ts)
{
Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+#ifdef TIDSTORE_DEBUG
+ if (TidStoreIsShared(ts))
+ dsm_detach(ts->tids_seg);
+#endif
+
shared_rt_detach(ts->tree.shared);
pfree(ts);
}
@@ -315,6 +378,13 @@ TidStoreDestroy(TidStore *ts)
local_rt_free(ts->tree.local);
}
+#ifdef TIDSTORE_DEBUG
+ if (TidStoreIsShared(ts))
+ dsm_detach(ts->tids_seg);
+ else
+ pfree(ts->tids);
+#endif
+
pfree(ts);
}
@@ -434,6 +504,11 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
}
}
+#ifdef TIDSTORE_DEBUG
+ /* Insert tids into the tid array too */
+ ts_debug_set_block_offsets(ts, blkno, offsets, num_offsets);
+#endif
+
/* update statistics */
ts->control->num_tids += num_offsets;
@@ -451,6 +526,11 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
offsetbm off_bitmap = 0;
offsetbm off_bit;
bool found;
+ bool ret;
+
+#ifdef TIDSTORE_DEBUG
+ bool ret_debug = ts_debug_is_member(ts, tid);
+#endif
key = encode_tid(ts, tid, &off_bit);
@@ -460,9 +540,20 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
found = local_rt_search(ts->tree.local, key, &off_bitmap);
if (!found)
+ {
+#ifdef TIDSTORE_DEBUG
+ Assert(!ret_debug);
+#endif
return false;
+ }
+
+ ret = (off_bitmap & off_bit) != 0;
- return (off_bitmap & off_bit) != 0;
+#ifdef TIDSTORE_DEBUG
+ Assert(ret == ret_debug);
+#endif
+
+ return ret;
}
/*
@@ -494,6 +585,10 @@ TidStoreBeginIterate(TidStore *ts)
if (TidStoreNumTids(ts) == 0)
iter->finished = true;
+#ifdef TIDSTORE_DEBUG
+ iter->tids_idx = 0;
+#endif
+
return iter;
}
@@ -515,6 +610,7 @@ TidStoreIterResult *
TidStoreIterateNext(TidStoreIter *iter)
{
tidkey key;
+ bool iter_found;
offsetbm off_bitmap = 0;
TidStoreIterResult *output = &(iter->output);
@@ -532,7 +628,7 @@ TidStoreIterateNext(TidStoreIter *iter)
if (iter->next_off_bitmap > 0)
iter_decode_key_off(iter, iter->next_tidkey, iter->next_off_bitmap);
- while (tidstore_iter(iter, &key, &off_bitmap))
+ while ((iter_found = tidstore_iter(iter, &key, &off_bitmap)))
{
BlockNumber blkno = key_get_blkno(iter->ts, key);
@@ -545,14 +641,20 @@ TidStoreIterateNext(TidStoreIter *iter)
*/
iter->next_tidkey = key;
iter->next_off_bitmap = off_bitmap;
- return output;
+ break;
}
/* Collect tids decoded from the key and offset bitmap */
iter_decode_key_off(iter, key, off_bitmap);
}
- iter->finished = true;
+ if (!iter_found)
+ iter->finished = true;
+
+#ifdef TIDSTORE_DEBUG
+ ts_debug_iter_check_tids(iter);
+#endif
+
return output;
}
@@ -699,3 +801,135 @@ encode_blk_off(TidStore *ts, BlockNumber block, OffsetNumber offset,
return key;
}
+
+#ifdef TIDSTORE_DEBUG
+/* Comparator routines for ItemPointer */
+static int
+itemptr_cmp(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+
+/* Insert tids to the tid array for debugging */
+static void
+ts_debug_set_block_offsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ if (ts->control->num_tids > 0 &&
+ blkno < ItemPointerGetBlockNumber(&(ts->tids[ts->control->num_tids - 1])))
+ {
+ /* The array will be sorted at ts_debug_is_member() */
+ ts->control->tids_unordered = true;
+ }
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ ItemPointer tid;
+ int idx = ts->control->num_tids + i;
+
+ /* Enlarge the tid array if necessary */
+ if (idx >= ts->control->max_tids)
+ {
+ ts->control->max_tids *= 2;
+
+ if (TidStoreIsShared(ts))
+ {
+ dsm_segment *new_seg =
+ dsm_create(sizeof(ItemPointerData) * ts->control->max_tids, 0);
+ ItemPointer new_tids = dsm_segment_address(new_seg);
+
+ /* copy tids from old to new array */
+ memcpy(new_tids, ts->tids,
+ sizeof(ItemPointerData) * (ts->control->max_tids / 2));
+
+ dsm_detach(ts->tids_seg);
+ ts->tids = new_tids;
+ }
+ else
+ ts->tids = repalloc(ts->tids,
+ sizeof(ItemPointerData) * ts->control->max_tids);
+ }
+
+ tid = &(ts->tids[idx]);
+ ItemPointerSetBlockNumber(tid, blkno);
+ ItemPointerSetOffsetNumber(tid, offsets[i]);
+ }
+}
+
+/* Return true if the given tid is present in the tid array */
+static bool
+ts_debug_is_member(TidStore *ts, ItemPointer tid)
+{
+ int64 litem,
+ ritem,
+ item;
+ ItemPointer res;
+
+ if (ts->control->num_tids == 0)
+ return false;
+
+ /* Make sure the tid array is sorted */
+ if (ts->control->tids_unordered)
+ {
+ qsort(ts->tids, ts->control->num_tids, sizeof(ItemPointerData), itemptr_cmp);
+ ts->control->tids_unordered = false;
+ }
+
+ litem = itemptr_encode(&ts->tids[0]);
+ ritem = itemptr_encode(&ts->tids[ts->control->num_tids - 1]);
+ item = itemptr_encode(tid);
+
+ /*
+ * Doing a simple bound check before bsearch() is useful to avoid the
+ * extra cost of bsearch(), especially if dead items on the heap are
+ * concentrated in a certain range. Since this function is called for
+ * every index tuple, it pays to be really fast.
+ */
+ if (item < litem || item > ritem)
+ return false;
+
+ res = bsearch(tid, ts->tids, ts->control->num_tids, sizeof(ItemPointerData),
+ itemptr_cmp);
+
+ return (res != NULL);
+}
+
+/* Verify if the iterator output matches the tids in the array for debugging */
+static void
+ts_debug_iter_check_tids(TidStoreIter *iter)
+{
+ BlockNumber blkno = iter->output.blkno;
+
+ for (int i = 0; i < iter->output.num_offsets; i++)
+ {
+ ItemPointer tid = &(iter->ts->tids[iter->tids_idx + i]);
+
+ Assert((iter->tids_idx + i) < iter->ts->control->max_tids);
+ Assert(ItemPointerGetBlockNumber(tid) == blkno);
+ Assert(ItemPointerGetOffsetNumber(tid) == iter->output.offsets[i]);
+ }
+
+ iter->tids_idx += iter->output.num_offsets;
+}
+#endif
--
2.31.1
On Wed, Feb 22, 2023 at 1:16 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Mon, Feb 20, 2023 at 2:56 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
Yeah, I did a similar thing in an earlier version of tidstore patch.
Okay, if you had checks against the old array lookup in development, that
gives us better confidence.
Since we're trying to introduce two new components: radix tree and
tidstore, I sometimes find it hard to investigate failures happening
during lazy (parallel) vacuum due to a bug either in tidstore or radix
tree. If there is a bug in lazy vacuum, we cannot even do initdb. So
it might be a good idea to do such checks in USE_ASSERT_CHECKING (or
with another macro say DEBUG_TIDSTORE) builds. For example, TidStore
stores tids to both the radix tree and array, and checks if the
results match when lookup or iteration. It will use more memory but it
would not be a big problem in USE_ASSERT_CHECKING builds. It would
also be great if we can enable such checks on some bf animals.I've tried this idea. Enabling this check on all debug builds (i.e.,
with USE_ASSERT_CHECKING macro) seems not a good idea so I use a
special macro for that, TIDSTORE_DEBUG. I think we can define this
macro on some bf animals (or possibly a new bf animal).
I don't think any vacuum calls in regression tests would stress any of
this code very much, so it's not worth carrying the old way forward. I was
thinking of only doing this as a short-time sanity check for testing a
real-world workload.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Feb 22, 2023 at 4:35 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Wed, Feb 22, 2023 at 1:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Feb 20, 2023 at 2:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Yeah, I did a similar thing in an earlier version of tidstore patch.
Okay, if you had checks against the old array lookup in development, that gives us better confidence.
Since we're trying to introduce two new components: radix tree and
tidstore, I sometimes find it hard to investigate failures happening
during lazy (parallel) vacuum due to a bug either in tidstore or radix
tree. If there is a bug in lazy vacuum, we cannot even do initdb. So
it might be a good idea to do such checks in USE_ASSERT_CHECKING (or
with another macro say DEBUG_TIDSTORE) builds. For example, TidStore
stores tids to both the radix tree and array, and checks if the
results match when lookup or iteration. It will use more memory but it
would not be a big problem in USE_ASSERT_CHECKING builds. It would
also be great if we can enable such checks on some bf animals.I've tried this idea. Enabling this check on all debug builds (i.e.,
with USE_ASSERT_CHECKING macro) seems not a good idea so I use a
special macro for that, TIDSTORE_DEBUG. I think we can define this
macro on some bf animals (or possibly a new bf animal).I don't think any vacuum calls in regression tests would stress any of this code very much, so it's not worth carrying the old way forward. I was thinking of only doing this as a short-time sanity check for testing a real-world workload.
I guess that It would also be helpful at least until the GA release.
People will be able to test them easily on their workloads or their
custom test scenarios.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Wed, Feb 22, 2023 at 3:29 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Wed, Feb 22, 2023 at 4:35 PM John Naylor
<john.naylor@enterprisedb.com> wrote:I don't think any vacuum calls in regression tests would stress any of
this code very much, so it's not worth carrying the old way forward. I was
thinking of only doing this as a short-time sanity check for testing a
real-world workload.
I guess that It would also be helpful at least until the GA release.
People will be able to test them easily on their workloads or their
custom test scenarios.
That doesn't seem useful to me. If we've done enough testing to reassure us
the new way always gives the same answer, the old way is not needed at
commit time. If there is any doubt it will always give the same answer,
then the whole patchset won't be committed.
TPC-C was just an example. It should have testing comparing the old and new
methods. If you have already done that to some degree, that might be
enough. After performance tests, I'll also try some vacuums that use the
comparison patch.
--
John Naylor
EDB: http://www.enterprisedb.com
I ran a couple "in situ" tests on server hardware using UUID columns, since
they are common in the real world and have bad correlation to heap
order, so are a challenge for index vacuum.
=== test 1, delete everything from a small table, with very small
maintenance_work_mem:
alter system set shared_buffers ='4GB';
alter system set max_wal_size ='10GB';
alter system set checkpoint_timeout ='30 min';
alter system set autovacuum =off;
-- unrealistically low
alter system set maintenance_work_mem = '32MB';
create table if not exists test (x uuid);
truncate table test;
insert into test (x) select gen_random_uuid() from
generate_series(1,50*1000*1000);
create index on test (x);
delete from test;
vacuum (verbose, truncate off) test;
--
master:
INFO: finished vacuuming "john.naylor.public.test": index scans: 9
system usage: CPU: user: 70.04 s, system: 19.85 s, elapsed: 802.06 s
v29 patch:
INFO: finished vacuuming "john.naylor.public.test": index scans: 1
system usage: CPU: user: 9.80 s, system: 2.62 s, elapsed: 36.68 s
This is a bit artificial, but it's easy to construct cases where the array
leads to multiple index scans but the new tid store can fit everythin
without breaking a sweat. I didn't save the progress reporting, but v29 was
using about 11MB for tid storage.
=== test 2: try to stress tid lookup with production maintenance_work_mem:
1. use unlogged table to reduce noise
2. vacuum freeze first to reduce heap scan time
3. delete some records at the beginning and end of heap to defeat binary
search's pre-check
alter system set shared_buffers ='4GB';
alter system set max_wal_size ='10GB';
alter system set checkpoint_timeout ='30 min';
alter system set autovacuum =off;
alter system set maintenance_work_mem = '1GB';
create unlogged table if not exists test (x uuid);
truncate table test;
insert into test (x) select gen_random_uuid() from
generate_series(1,1000*1000*1000);
vacuum_freeze test;
select pg_size_pretty(pg_table_size('test'));
pg_size_pretty
----------------
41 GB
create index on test (x);
select pg_size_pretty(pg_total_relation_size('test'));
pg_size_pretty
----------------
71 GB
select max(ctid) from test;
max
--------------
(5405405,75)
delete from test where ctid < '(100000,0)'::tid;
delete from test where ctid > '(5300000,0)'::tid;
vacuum (verbose, truncate off) test;
both:
INFO: vacuuming "john.naylor.public.test"
INFO: finished vacuuming "john.naylor.public.test": index scans: 1
index scan needed: 205406 pages from table (3.80% of total) had 38000000
dead item identifiers removed
--
master:
system usage: CPU: user: 134.32 s, system: 19.24 s, elapsed: 286.14 s
v29 patch:
system usage: CPU: user: 97.71 s, system: 45.78 s, elapsed: 573.94 s
The entire vacuum took 25% less wall clock time. Reminder that this is
without wal logging, and also unscientific because only one run.
--
I took 10 seconds of perf data while index vacuuming was going on (showing
calls > 2%):
master:
40.59% postgres postgres [.] vac_cmp_itemptr
24.97% postgres libc-2.17.so [.] bsearch
6.67% postgres postgres [.] btvacuumpage
4.61% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
3.48% postgres postgres [.] PageIndexMultiDelete
2.67% postgres postgres [.] vac_tid_reaped
2.03% postgres postgres [.] compactify_tuples
2.01% postgres libc-2.17.so [.] __memcpy_ssse3_back
v29 patch:
29.22% postgres postgres [.] TidStoreIsMember
9.30% postgres postgres [.] btvacuumpage
7.76% postgres postgres [.] PageIndexMultiDelete
6.31% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
5.60% postgres postgres [.] compactify_tuples
4.26% postgres libc-2.17.so [.] __memcpy_ssse3_back
4.12% postgres postgres [.] hash_search_with_hash_value
--
master:
psql -c "select phase, heap_blks_total, heap_blks_scanned, max_dead_tuples,
num_dead_tuples from pg_stat_progress_vacuum"
phase | heap_blks_total | heap_blks_scanned | max_dead_tuples
| num_dead_tuples
-------------------+-----------------+-------------------+-----------------+-----------------
vacuuming indexes | 5405406 | 5405406 | 178956969
| 38000000
v29 patch:
psql -c "select phase, heap_blks_total, heap_blks_scanned,
max_dead_tuple_bytes, dead_tuple_bytes from pg_stat_progress_vacuum"
phase | heap_blks_total | heap_blks_scanned |
max_dead_tuple_bytes | dead_tuple_bytes
-------------------+-----------------+-------------------+----------------------+------------------
vacuuming indexes | 5405406 | 5405406 |
1073670144 | 8678064
Here, the old array pessimistically needs 1GB allocated (as for any table >
~5GB), but only fills 228MB for tid lookup. The patch reports 8.7MB. Tables
that only fit, say, 30-50 tuples per page will have less extreme
differences in memory use. Same for the case where only a couple dead items
occur per page, with many uninteresting pages in between. Even so, the
allocation will be much more accurately sized in the patch, especially in
non-parallel vacuum.
There are other cases that could be tested (I mentioned some above), but
this is enough to show the improvements possible.
I still need to do some cosmetic follow-up to v29 as well as a status
report, and I will try to get back to that soon.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Feb 22, 2023 at 6:55 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Wed, Feb 22, 2023 at 3:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Feb 22, 2023 at 4:35 PM John Naylor
<john.naylor@enterprisedb.com> wrote:I don't think any vacuum calls in regression tests would stress any of this code very much, so it's not worth carrying the old way forward. I was thinking of only doing this as a short-time sanity check for testing a real-world workload.
I guess that It would also be helpful at least until the GA release.
People will be able to test them easily on their workloads or their
custom test scenarios.That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer, the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the whole patchset won't be committed.
True. Even if we're done enough testing we cannot claim there is no
bug. My idea is to make the bug investigation easier but on
reflection, it seems not the best idea given this purpose. Instead, it
seems to be better to add more necessary assertions. What do you think
about the attached patch? Please note that it also includes the
changes for minimum memory requirement.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
add_assertions.patch.txttext/plain; charset=US-ASCII; name=add_assertions.patch.txtDownload
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 9360520482..fc20e58a95 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -75,6 +75,14 @@ typedef uint64 offsetbm;
#define LOWER_OFFSET_NBITS 6 /* log(sizeof(offsetbm), 2) */
#define LOWER_OFFSET_MASK ((1 << LOWER_OFFSET_NBITS) - 1)
+/*
+ * The minimum amount of memory required by TidStore is 2MB, the current minimum
+ * valid value for the maintenance_work_mem GUC. This is required to allocate the
+ * DSA initial segment, 1MB, and some meta data. This number is applied also to
+ * the local TidStore cases for simplicity.
+ */
+#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */
+
/* A magic value used to identify our TidStore. */
#define TIDSTORE_MAGIC 0x826f6a10
@@ -101,7 +109,7 @@ typedef struct TidStoreControl
/* These values are never changed after creation */
size_t max_bytes; /* the maximum bytes a TidStore can use */
- int max_off; /* the maximum offset number */
+ OffsetNumber max_off; /* the maximum offset number */
int max_off_nbits; /* the number of bits required for offset
* numbers */
int upper_off_nbits; /* the number of bits of offset numbers
@@ -174,10 +182,17 @@ static inline tidkey encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit
* The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
*/
TidStore *
-TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
+TidStoreCreate(size_t max_bytes, OffsetNumber max_off, dsa_area *area)
{
TidStore *ts;
+ Assert(max_off <= MaxOffsetNumber);
+
+ /* Sanity check for the max_bytes */
+ if (max_bytes < TIDSTORE_MIN_MEMORY)
+ elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided",
+ TIDSTORE_MIN_MEMORY, max_bytes);
+
ts = palloc0(sizeof(TidStore));
/*
@@ -192,8 +207,8 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
* In local TidStore cases, the radix tree uses slab allocators for each kind
* of node class. The most memory consuming case while adding Tids associated
* with one page (i.e. during TidStoreSetBlockOffsets()) is that we allocate a new
- * slab block for a new radix tree node, which is approximately 70kB. Therefore,
- * we deduct 70kB from the max_bytes.
+ * slab block for a new radix tree node, which is approximately 70kB at most.
+ * Therefore, we deduct 70kB from the max_bytes.
*
* In shared cases, DSA allocates the memory segments big enough to follow
* a geometric series that approximately doubles the total DSA size (see
@@ -378,6 +393,7 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
const int nkeys = UINT64CONST(1) << ts->control->upper_off_nbits;
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+ Assert(BlockNumberIsValid(blkno));
bitmaps = palloc(sizeof(offsetbm) * nkeys);
key = prev_key = key_base;
@@ -386,6 +402,8 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
{
offsetbm off_bit;
+ Assert(offsets[i] <= ts->control->max_off);
+
/* encode the tid to a key and partial offset */
key = encode_blk_off(ts, blkno, offsets[i], &off_bit);
@@ -452,6 +470,8 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
offsetbm off_bit;
bool found;
+ Assert(ItemPointerIsValid(tid));
+
key = encode_tid(ts, tid, &off_bit);
if (TidStoreIsShared(ts))
@@ -535,6 +555,7 @@ TidStoreIterateNext(TidStoreIter *iter)
while (tidstore_iter(iter, &key, &off_bitmap))
{
BlockNumber blkno = key_get_blkno(iter->ts, key);
+ Assert(BlockNumberIsValid(blkno));
if (BlockNumberIsValid(output->blkno) && output->blkno != blkno)
{
@@ -586,6 +607,7 @@ TidStoreNumTids(TidStore *ts)
num_tids = ts->control->num_tids;
LWLockRelease(&ts->control->lock);
+ Assert(num_tids >= 0);
return num_tids;
}
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index 66f0fdd482..d1cc93cbb6 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -30,7 +30,7 @@ typedef struct TidStoreIterResult
OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
} TidStoreIterResult;
-extern TidStore *TidStoreCreate(size_t max_bytes, int max_off, dsa_area *dsa);
+extern TidStore *TidStoreCreate(size_t max_bytes, OffsetNumber max_off, dsa_area *dsa);
extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer handle);
extern void TidStoreDetach(TidStore *ts);
extern void TidStoreDestroy(TidStore *ts);
On Thu, Feb 23, 2023 at 6:41 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
I ran a couple "in situ" tests on server hardware using UUID columns, since they are common in the real world and have bad correlation to heap order, so are a challenge for index vacuum.
Thank you for the test!
=== test 1, delete everything from a small table, with very small maintenance_work_mem:
alter system set shared_buffers ='4GB';
alter system set max_wal_size ='10GB';
alter system set checkpoint_timeout ='30 min';
alter system set autovacuum =off;-- unrealistically low
alter system set maintenance_work_mem = '32MB';create table if not exists test (x uuid);
truncate table test;
insert into test (x) select gen_random_uuid() from generate_series(1,50*1000*1000);
create index on test (x);delete from test;
vacuum (verbose, truncate off) test;
--master:
INFO: finished vacuuming "john.naylor.public.test": index scans: 9
system usage: CPU: user: 70.04 s, system: 19.85 s, elapsed: 802.06 sv29 patch:
INFO: finished vacuuming "john.naylor.public.test": index scans: 1
system usage: CPU: user: 9.80 s, system: 2.62 s, elapsed: 36.68 sThis is a bit artificial, but it's easy to construct cases where the array leads to multiple index scans but the new tid store can fit everythin without breaking a sweat. I didn't save the progress reporting, but v29 was using about 11MB for tid storage.
Cool.
=== test 2: try to stress tid lookup with production maintenance_work_mem:
1. use unlogged table to reduce noise
2. vacuum freeze first to reduce heap scan time
3. delete some records at the beginning and end of heap to defeat binary search's pre-checkalter system set shared_buffers ='4GB';
alter system set max_wal_size ='10GB';
alter system set checkpoint_timeout ='30 min';
alter system set autovacuum =off;alter system set maintenance_work_mem = '1GB';
create unlogged table if not exists test (x uuid);
truncate table test;
insert into test (x) select gen_random_uuid() from generate_series(1,1000*1000*1000);
vacuum_freeze test;select pg_size_pretty(pg_table_size('test'));
pg_size_pretty
----------------
41 GBcreate index on test (x);
select pg_size_pretty(pg_total_relation_size('test'));
pg_size_pretty
----------------
71 GBselect max(ctid) from test;
max
--------------
(5405405,75)delete from test where ctid < '(100000,0)'::tid;
delete from test where ctid > '(5300000,0)'::tid;vacuum (verbose, truncate off) test;
both:
INFO: vacuuming "john.naylor.public.test"
INFO: finished vacuuming "john.naylor.public.test": index scans: 1
index scan needed: 205406 pages from table (3.80% of total) had 38000000 dead item identifiers removed--
master:
system usage: CPU: user: 134.32 s, system: 19.24 s, elapsed: 286.14 sv29 patch:
system usage: CPU: user: 97.71 s, system: 45.78 s, elapsed: 573.94 s
In v29 vacuum took twice as long (286 s vs. 573 s)?
The entire vacuum took 25% less wall clock time. Reminder that this is without wal logging, and also unscientific because only one run.
--
I took 10 seconds of perf data while index vacuuming was going on (showing calls > 2%):master:
40.59% postgres postgres [.] vac_cmp_itemptr
24.97% postgres libc-2.17.so [.] bsearch
6.67% postgres postgres [.] btvacuumpage
4.61% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
3.48% postgres postgres [.] PageIndexMultiDelete
2.67% postgres postgres [.] vac_tid_reaped
2.03% postgres postgres [.] compactify_tuples
2.01% postgres libc-2.17.so [.] __memcpy_ssse3_backv29 patch:
29.22% postgres postgres [.] TidStoreIsMember
9.30% postgres postgres [.] btvacuumpage
7.76% postgres postgres [.] PageIndexMultiDelete
6.31% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
5.60% postgres postgres [.] compactify_tuples
4.26% postgres libc-2.17.so [.] __memcpy_ssse3_back
4.12% postgres postgres [.] hash_search_with_hash_value--
master:
psql -c "select phase, heap_blks_total, heap_blks_scanned, max_dead_tuples, num_dead_tuples from pg_stat_progress_vacuum"
phase | heap_blks_total | heap_blks_scanned | max_dead_tuples | num_dead_tuples
-------------------+-----------------+-------------------+-----------------+-----------------
vacuuming indexes | 5405406 | 5405406 | 178956969 | 38000000v29 patch:
psql -c "select phase, heap_blks_total, heap_blks_scanned, max_dead_tuple_bytes, dead_tuple_bytes from pg_stat_progress_vacuum"
phase | heap_blks_total | heap_blks_scanned | max_dead_tuple_bytes | dead_tuple_bytes
-------------------+-----------------+-------------------+----------------------+------------------
vacuuming indexes | 5405406 | 5405406 | 1073670144 | 8678064Here, the old array pessimistically needs 1GB allocated (as for any table > ~5GB), but only fills 228MB for tid lookup. The patch reports 8.7MB. Tables that only fit, say, 30-50 tuples per page will have less extreme differences in memory use. Same for the case where only a couple dead items occur per page, with many uninteresting pages in between. Even so, the allocation will be much more accurately sized in the patch, especially in non-parallel vacuum.
Agreed.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Fri, Feb 24, 2023 at 3:41 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
In v29 vacuum took twice as long (286 s vs. 573 s)?
Not sure what happened there, and clearly I was looking at the wrong number
:/
I scripted the test for reproducibility and ran it three times. Also
included some variations (attached):
UUID times look comparable here, so no speedup or regression:
master:
system usage: CPU: user: 216.05 s, system: 35.81 s, elapsed: 634.22 s
system usage: CPU: user: 173.71 s, system: 31.24 s, elapsed: 599.04 s
system usage: CPU: user: 171.16 s, system: 30.21 s, elapsed: 583.21 s
v29:
system usage: CPU: user: 93.47 s, system: 40.92 s, elapsed: 594.10 s
system usage: CPU: user: 99.58 s, system: 44.73 s, elapsed: 606.80 s
system usage: CPU: user: 96.29 s, system: 42.74 s, elapsed: 600.10 s
Then, I tried sequential integers, which is a much more favorable access
pattern in general, and the new tid storage shows substantial improvement:
master:
system usage: CPU: user: 100.39 s, system: 7.79 s, elapsed: 121.57 s
system usage: CPU: user: 104.90 s, system: 8.81 s, elapsed: 124.24 s
system usage: CPU: user: 95.04 s, system: 7.55 s, elapsed: 116.44 s
v29:
system usage: CPU: user: 24.57 s, system: 8.53 s, elapsed: 61.07 s
system usage: CPU: user: 23.18 s, system: 8.25 s, elapsed: 58.99 s
system usage: CPU: user: 23.20 s, system: 8.98 s, elapsed: 66.86 s
That's fast enough that I thought an improvement would show up even with
standard WAL logging (no separate attachment, since it's a trivial change).
Seems a bit faster:
master:
system usage: CPU: user: 152.27 s, system: 11.76 s, elapsed: 216.86 s
system usage: CPU: user: 137.25 s, system: 11.07 s, elapsed: 213.62 s
system usage: CPU: user: 149.48 s, system: 12.15 s, elapsed: 220.96 s
v29:
system usage: CPU: user: 40.88 s, system: 15.99 s, elapsed: 170.98 s
system usage: CPU: user: 41.33 s, system: 15.45 s, elapsed: 166.75 s
system usage: CPU: user: 41.51 s, system: 18.20 s, elapsed: 203.94 s
There is more we could test here, but I feel better about these numbers.
In the next few days, I'll resume style review and list the remaining
issues we need to address.
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Wed, Feb 22, 2023 at 6:55 PM John Naylor
<john.naylor@enterprisedb.com> wrote:That doesn't seem useful to me. If we've done enough testing to
reassure us the new way always gives the same answer, the old way is not
needed at commit time. If there is any doubt it will always give the same
answer, then the whole patchset won't be committed.
My idea is to make the bug investigation easier but on
reflection, it seems not the best idea given this purpose.
My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old
tid array. As I've said, that doesn't seem like a good thing to carry
forward forevermore, in any form. Plus, comparing new code with new code is
not the same thing as comparing existing code with new code. That was my
idea upthread.
Maybe the effort my idea requires is too much vs. the likelihood of finding
a problem. In any case, it's clear that if I want that level of paranoia,
I'm going to have to do it myself.
What do you think
about the attached patch? Please note that it also includes the
changes for minimum memory requirement.
Most of the asserts look logical, or at least harmless.
- int max_off; /* the maximum offset number */
+ OffsetNumber max_off; /* the maximum offset number */
I agree with using the specific type for offsets here, but I'm not sure why
this change belongs in this patch. If we decided against the new asserts,
this would be easy to lose.
This change, however, defies common sense:
+/*
+ * The minimum amount of memory required by TidStore is 2MB, the current
minimum
+ * valid value for the maintenance_work_mem GUC. This is required to
allocate the
+ * DSA initial segment, 1MB, and some meta data. This number is applied
also to
+ * the local TidStore cases for simplicity.
+ */
+#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */
+ /* Sanity check for the max_bytes */
+ if (max_bytes < TIDSTORE_MIN_MEMORY)
+ elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided",
+ TIDSTORE_MIN_MEMORY, max_bytes);
Aside from the fact that this elog's something that would never get past
development, the #define just adds a hard-coded copy of something that is
already hard-coded somewhere else, whose size depends on an implementation
detail in a third place.
This also assumes that all users of tid store are limited by
maintenance_work_mem. Andres thought of an example of some day unifying
with tidbitmap.c, and maybe other applications will be limited by work_mem.
But now that I'm looking at the guc tables, I am reminded that work_mem's
minimum is 64kB, so this highlights a design problem: There is obviously no
requirement that the minimum work_mem has to be >= a single DSA segment,
even though operations like parallel hash and parallel bitmap heap scan are
limited by work_mem. It would be nice to find out what happens with these
parallel features when work_mem is tiny (maybe parallelism is not even
considered?).
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Feb 28, 2023 at 3:42 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Feb 22, 2023 at 6:55 PM John Naylor
<john.naylor@enterprisedb.com> wrote:That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer, the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the whole patchset won't be committed.
My idea is to make the bug investigation easier but on
reflection, it seems not the best idea given this purpose.My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old tid array. As I've said, that doesn't seem like a good thing to carry forward forevermore, in any form. Plus, comparing new code with new code is not the same thing as comparing existing code with new code. That was my idea upthread.
Maybe the effort my idea requires is too much vs. the likelihood of finding a problem. In any case, it's clear that if I want that level of paranoia, I'm going to have to do it myself.
What do you think
about the attached patch? Please note that it also includes the
changes for minimum memory requirement.Most of the asserts look logical, or at least harmless.
- int max_off; /* the maximum offset number */ + OffsetNumber max_off; /* the maximum offset number */I agree with using the specific type for offsets here, but I'm not sure why this change belongs in this patch. If we decided against the new asserts, this would be easy to lose.
Right. I'll separate this change as a separate patch.
This change, however, defies common sense:
+/* + * The minimum amount of memory required by TidStore is 2MB, the current minimum + * valid value for the maintenance_work_mem GUC. This is required to allocate the + * DSA initial segment, 1MB, and some meta data. This number is applied also to + * the local TidStore cases for simplicity. + */ +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */+ /* Sanity check for the max_bytes */ + if (max_bytes < TIDSTORE_MIN_MEMORY) + elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided", + TIDSTORE_MIN_MEMORY, max_bytes);Aside from the fact that this elog's something that would never get past development, the #define just adds a hard-coded copy of something that is already hard-coded somewhere else, whose size depends on an implementation detail in a third place.
This also assumes that all users of tid store are limited by maintenance_work_mem. Andres thought of an example of some day unifying with tidbitmap.c, and maybe other applications will be limited by work_mem.
But now that I'm looking at the guc tables, I am reminded that work_mem's minimum is 64kB, so this highlights a design problem: There is obviously no requirement that the minimum work_mem has to be >= a single DSA segment, even though operations like parallel hash and parallel bitmap heap scan are limited by work_mem.
Right.
It would be nice to find out what happens with these parallel features when work_mem is tiny (maybe parallelism is not even considered?).
IIUC both don't care about the allocated DSA segment size. Parallel
hash accounts actual tuple (+ header) size as used memory but doesn't
consider how much DSA segment is allocated behind. Both parallel hash
and parallel bitmap scan can work even with work_mem = 64kB, but when
checking the total DSA segment size allocated during these operations,
it was 1MB.
I realized that there is a similar memory limit design issue also on
the non-shared tidstore cases. We deduct 70kB from max_bytes but it
won't work fine with work_mem = 64kB. Probably we need to reconsider
it. FYI 70kB comes from the maximum slab block size for node256.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Feb 28, 2023 at 3:42 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Feb 22, 2023 at 6:55 PM John Naylor
<john.naylor@enterprisedb.com> wrote:That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer, the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the whole patchset won't be committed.
My idea is to make the bug investigation easier but on
reflection, it seems not the best idea given this purpose.My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old tid array. As I've said, that doesn't seem like a good thing to carry forward forevermore, in any form. Plus, comparing new code with new code is not the same thing as comparing existing code with new code. That was my idea upthread.
Maybe the effort my idea requires is too much vs. the likelihood of finding a problem. In any case, it's clear that if I want that level of paranoia, I'm going to have to do it myself.
What do you think
about the attached patch? Please note that it also includes the
changes for minimum memory requirement.Most of the asserts look logical, or at least harmless.
- int max_off; /* the maximum offset number */ + OffsetNumber max_off; /* the maximum offset number */I agree with using the specific type for offsets here, but I'm not sure why this change belongs in this patch. If we decided against the new asserts, this would be easy to lose.
Right. I'll separate this change as a separate patch.
This change, however, defies common sense:
+/* + * The minimum amount of memory required by TidStore is 2MB, the current minimum + * valid value for the maintenance_work_mem GUC. This is required to allocate the + * DSA initial segment, 1MB, and some meta data. This number is applied also to + * the local TidStore cases for simplicity. + */ +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */+ /* Sanity check for the max_bytes */ + if (max_bytes < TIDSTORE_MIN_MEMORY) + elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided", + TIDSTORE_MIN_MEMORY, max_bytes);Aside from the fact that this elog's something that would never get past development, the #define just adds a hard-coded copy of something that is already hard-coded somewhere else, whose size depends on an implementation detail in a third place.
This also assumes that all users of tid store are limited by maintenance_work_mem. Andres thought of an example of some day unifying with tidbitmap.c, and maybe other applications will be limited by work_mem.
But now that I'm looking at the guc tables, I am reminded that work_mem's minimum is 64kB, so this highlights a design problem: There is obviously no requirement that the minimum work_mem has to be >= a single DSA segment, even though operations like parallel hash and parallel bitmap heap scan are limited by work_mem.
Right.
It would be nice to find out what happens with these parallel features when work_mem is tiny (maybe parallelism is not even considered?).
IIUC both don't care about the allocated DSA segment size. Parallel
hash accounts actual tuple (+ header) size as used memory but doesn't
consider how much DSA segment is allocated behind. Both parallel hash
and parallel bitmap scan can work even with work_mem = 64kB, but when
checking the total DSA segment size allocated during these operations,
it was 1MB.I realized that there is a similar memory limit design issue also on
the non-shared tidstore cases. We deduct 70kB from max_bytes but it
won't work fine with work_mem = 64kB. Probably we need to reconsider
it. FYI 70kB comes from the maximum slab block size for node256.
Currently, we calculate the slab block size enough to allocate 32
chunks from there. For node256, the leaf node is 2,088 bytes and the
slab block size is 66,816 bytes. One idea to fix this issue to
decrease it. For example, with 16 chunks the slab block size is 33,408
bytes and with 8 chunks it's 16,704 bytes. I ran a brief benchmark
test with 70kB block size and 16kB block size:
* 70kB slab blocks:
select * from bench_search_random_nodes(20 * 1000 * 1000, '0xFFFFFF');
height = 2, n3 = 0, n15 = 0, n32 = 0, n125 = 0, n256 = 65793
mem_allocated | load_ms | search_ms
---------------+---------+-----------
143085184 | 1216 | 750
(1 row)
* 16kB slab blocks:
select * from bench_search_random_nodes(20 * 1000 * 1000, '0xFFFFFF');
height = 2, n3 = 0, n15 = 0, n32 = 0, n125 = 0, n256 = 65793
mem_allocated | load_ms | search_ms
---------------+---------+-----------
157601248 | 1220 | 786
(1 row)
There is a performance difference a bit but a smaller slab block size
seems to be acceptable if there is no other better way.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Tue, Feb 28, 2023 at 10:09 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Tue, Feb 28, 2023 at 3:42 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <
sawada.mshk@gmail.com> wrote:
On Wed, Feb 22, 2023 at 6:55 PM John Naylor
<john.naylor@enterprisedb.com> wrote:That doesn't seem useful to me. If we've done enough testing to
reassure us the new way always gives the same answer, the old way is not
needed at commit time. If there is any doubt it will always give the same
answer, then the whole patchset won't be committed.
My idea is to make the bug investigation easier but on
reflection, it seems not the best idea given this purpose.My concern with TIDSTORE_DEBUG is that it adds new code that mimics
the old tid array. As I've said, that doesn't seem like a good thing to
carry forward forevermore, in any form. Plus, comparing new code with new
code is not the same thing as comparing existing code with new code. That
was my idea upthread.
Maybe the effort my idea requires is too much vs. the likelihood of
finding a problem. In any case, it's clear that if I want that level of
paranoia, I'm going to have to do it myself.
What do you think
about the attached patch? Please note that it also includes the
changes for minimum memory requirement.Most of the asserts look logical, or at least harmless.
- int max_off; /* the maximum offset number */ + OffsetNumber max_off; /* the maximum offset number */I agree with using the specific type for offsets here, but I'm not
sure why this change belongs in this patch. If we decided against the new
asserts, this would be easy to lose.
Right. I'll separate this change as a separate patch.
This change, however, defies common sense:
+/* + * The minimum amount of memory required by TidStore is 2MB, the
current minimum
+ * valid value for the maintenance_work_mem GUC. This is required to
allocate the
+ * DSA initial segment, 1MB, and some meta data. This number is
applied also to
+ * the local TidStore cases for simplicity. + */ +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */+ /* Sanity check for the max_bytes */ + if (max_bytes < TIDSTORE_MIN_MEMORY) + elog(ERROR, "memory for TidStore must be at least %ld, but %zu
provided",
+ TIDSTORE_MIN_MEMORY, max_bytes);
Aside from the fact that this elog's something that would never get
past development, the #define just adds a hard-coded copy of something that
is already hard-coded somewhere else, whose size depends on an
implementation detail in a third place.
This also assumes that all users of tid store are limited by
maintenance_work_mem. Andres thought of an example of some day unifying
with tidbitmap.c, and maybe other applications will be limited by work_mem.
But now that I'm looking at the guc tables, I am reminded that
work_mem's minimum is 64kB, so this highlights a design problem: There is
obviously no requirement that the minimum work_mem has to be >= a single
DSA segment, even though operations like parallel hash and parallel bitmap
heap scan are limited by work_mem.
Right.
It would be nice to find out what happens with these parallel
features when work_mem is tiny (maybe parallelism is not even considered?).
IIUC both don't care about the allocated DSA segment size. Parallel
hash accounts actual tuple (+ header) size as used memory but doesn't
consider how much DSA segment is allocated behind. Both parallel hash
and parallel bitmap scan can work even with work_mem = 64kB, but when
checking the total DSA segment size allocated during these operations,
it was 1MB.I realized that there is a similar memory limit design issue also on
the non-shared tidstore cases. We deduct 70kB from max_bytes but it
won't work fine with work_mem = 64kB. Probably we need to reconsider
it. FYI 70kB comes from the maximum slab block size for node256.Currently, we calculate the slab block size enough to allocate 32
chunks from there. For node256, the leaf node is 2,088 bytes and the
slab block size is 66,816 bytes. One idea to fix this issue to
decrease it.
I think we're trying to solve the wrong problem here. I need to study this
more, but it seems that code that needs to stay within a memory limit only
needs to track what's been allocated in chunks within a block, since
writing there is what invokes a page fault. If we're not keeping track of
each and every chunk space, for speed, it doesn't follow that we need to
keep every block allocation within the configured limit. I'm guessing we
can just ask the context if the block space has gone *over* the limit, and
we can assume that the last allocation we perform will only fault one
additional page. We need to have a clear answer on this before doing
anything else.
If that's correct, and I'm not positive yet, we can get rid of all the
fragile assumptions about things the tid store has no business knowing
about, as well as the guc change. I'm not sure how this affects progress
reporting, because it would be nice if it didn't report dead_tuple_bytes
bigger than max_dead_tuple_bytes.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john.naylor@enterprisedb.com> wrote:
On Tue, Feb 28, 2023 at 10:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Feb 28, 2023 at 3:42 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Feb 22, 2023 at 6:55 PM John Naylor
<john.naylor@enterprisedb.com> wrote:That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer, the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the whole patchset won't be committed.
My idea is to make the bug investigation easier but on
reflection, it seems not the best idea given this purpose.My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old tid array. As I've said, that doesn't seem like a good thing to carry forward forevermore, in any form. Plus, comparing new code with new code is not the same thing as comparing existing code with new code. That was my idea upthread.
Maybe the effort my idea requires is too much vs. the likelihood of finding a problem. In any case, it's clear that if I want that level of paranoia, I'm going to have to do it myself.
What do you think
about the attached patch? Please note that it also includes the
changes for minimum memory requirement.Most of the asserts look logical, or at least harmless.
- int max_off; /* the maximum offset number */ + OffsetNumber max_off; /* the maximum offset number */I agree with using the specific type for offsets here, but I'm not sure why this change belongs in this patch. If we decided against the new asserts, this would be easy to lose.
Right. I'll separate this change as a separate patch.
This change, however, defies common sense:
+/* + * The minimum amount of memory required by TidStore is 2MB, the current minimum + * valid value for the maintenance_work_mem GUC. This is required to allocate the + * DSA initial segment, 1MB, and some meta data. This number is applied also to + * the local TidStore cases for simplicity. + */ +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */+ /* Sanity check for the max_bytes */ + if (max_bytes < TIDSTORE_MIN_MEMORY) + elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided", + TIDSTORE_MIN_MEMORY, max_bytes);Aside from the fact that this elog's something that would never get past development, the #define just adds a hard-coded copy of something that is already hard-coded somewhere else, whose size depends on an implementation detail in a third place.
This also assumes that all users of tid store are limited by maintenance_work_mem. Andres thought of an example of some day unifying with tidbitmap.c, and maybe other applications will be limited by work_mem.
But now that I'm looking at the guc tables, I am reminded that work_mem's minimum is 64kB, so this highlights a design problem: There is obviously no requirement that the minimum work_mem has to be >= a single DSA segment, even though operations like parallel hash and parallel bitmap heap scan are limited by work_mem.
Right.
It would be nice to find out what happens with these parallel features when work_mem is tiny (maybe parallelism is not even considered?).
IIUC both don't care about the allocated DSA segment size. Parallel
hash accounts actual tuple (+ header) size as used memory but doesn't
consider how much DSA segment is allocated behind. Both parallel hash
and parallel bitmap scan can work even with work_mem = 64kB, but when
checking the total DSA segment size allocated during these operations,
it was 1MB.I realized that there is a similar memory limit design issue also on
the non-shared tidstore cases. We deduct 70kB from max_bytes but it
won't work fine with work_mem = 64kB. Probably we need to reconsider
it. FYI 70kB comes from the maximum slab block size for node256.Currently, we calculate the slab block size enough to allocate 32
chunks from there. For node256, the leaf node is 2,088 bytes and the
slab block size is 66,816 bytes. One idea to fix this issue to
decrease it.I think we're trying to solve the wrong problem here. I need to study this more, but it seems that code that needs to stay within a memory limit only needs to track what's been allocated in chunks within a block, since writing there is what invokes a page fault.
Right. I guess we've discussed what we use for calculating the *used*
memory amount but I don't remember.
I think I was confused by the fact that we use some different
approaches to calculate the amount of used memory. Parallel hash and
tidbitmap use the allocated chunk size whereas hash_agg_check_limits()
in nodeAgg.c uses MemoryContextMemAllocated(), which uses the
allocated block size.
If we're not keeping track of each and every chunk space, for speed, it doesn't follow that we need to keep every block allocation within the configured limit. I'm guessing we can just ask the context if the block space has gone *over* the limit, and we can assume that the last allocation we perform will only fault one additional page. We need to have a clear answer on this before doing anything else.
If that's correct, and I'm not positive yet, we can get rid of all the fragile assumptions about things the tid store has no business knowing about, as well as the guc change.
True.
I'm not sure how this affects progress reporting, because it would be nice if it didn't report dead_tuple_bytes bigger than max_dead_tuple_bytes.
Yes, the progress reporting could be confusable. Particularly, in
shared tidstore cases, the dead_tuple_bytes could be much bigger than
max_dead_tuple_bytes. Probably what we need might be functions for
MemoryContext and dsa_area to get the amount of memory that has been
allocated, by not tracking every chunk space. For example, the
functions would be like what SlabStats() does; iterate over every
block and calculates the total/free memory usage.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Wed, Mar 1, 2023 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john.naylor@enterprisedb.com>
wrote:
I think we're trying to solve the wrong problem here. I need to study
this more, but it seems that code that needs to stay within a memory limit
only needs to track what's been allocated in chunks within a block, since
writing there is what invokes a page fault.
Right. I guess we've discussed what we use for calculating the *used*
memory amount but I don't remember.I think I was confused by the fact that we use some different
approaches to calculate the amount of used memory. Parallel hash and
tidbitmap use the allocated chunk size whereas hash_agg_check_limits()
in nodeAgg.c uses MemoryContextMemAllocated(), which uses the
allocated block size.
That's good to know. The latter says:
* After adding a new group to the hash table, check whether we need to
enter
* spill mode. Allocations may happen without adding new groups (for
instance,
* if the transition state size grows), so this check is imperfect.
I'm willing to claim that vacuum can be imperfect also, given the tid
store's properties: 1) on average much more efficient in used space, and 2)
no longer bound by the 1GB limit.
I'm not sure how this affects progress reporting, because it would be
nice if it didn't report dead_tuple_bytes bigger than max_dead_tuple_bytes.
Yes, the progress reporting could be confusable. Particularly, in
shared tidstore cases, the dead_tuple_bytes could be much bigger than
max_dead_tuple_bytes. Probably what we need might be functions for
MemoryContext and dsa_area to get the amount of memory that has been
allocated, by not tracking every chunk space. For example, the
functions would be like what SlabStats() does; iterate over every
block and calculates the total/free memory usage.
I'm not sure we need to invent new infrastructure for this. Looking at v29
in vacuumlazy.c, the order of operations for memory accounting is:
First, get the block-level space -- stop and vacuum indexes if we exceed
the limit:
/*
* Consider if we definitely have enough space to process TIDs on page
* already. If we are close to overrunning the available space for
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
if (TidStoreIsFull(vacrel->dead_items)) --> which is basically "if
(TidStoreMemoryUsage(ts) > ts->control->max_bytes)"
Then, after pruning the current page, store the tids and then get the
block-level space again:
else if (prunestate.num_offsets > 0)
{
/* Save details of the LP_DEAD items from the page in dead_items */
TidStoreSetBlockOffsets(...);
pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
TidStoreMemoryUsage(dead_items));
}
Since the block-level measurement is likely overestimating quite a bit, I
propose to simply reverse the order of the actions here, effectively
reporting progress for the *last page* and not the current one: First
update progress with the current memory usage, then add tids for this page.
If this allocated a new block, only a small bit of that will be written to.
If this block pushes it over the limit, we will detect that up at the top
of the loop. It's kind of like our earlier attempts at a "fudge factor",
but simpler and less brittle. And, as far as OS pages we have actually
written to, I think it'll effectively respect the memory limit, at least in
the local mem case. And the numbers will make sense.
Thoughts?
But now that I'm looking more closely at the details of memory accounting,
I don't like that TidStoreMemoryUsage() is called twice per page pruned
(see above). Maybe it wouldn't noticeably slow things down, but it's a bit
sloppy. It seems like we should call it once per loop and save the result
somewhere. If that's the right way to go, that possibly indicates that
TidStoreIsFull() is not a useful interface, at least in this form.
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Mar 3, 2023 at 8:04 PM John Naylor <john.naylor@enterprisedb.com> wrote:
On Wed, Mar 1, 2023 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john.naylor@enterprisedb.com> wrote:
I think we're trying to solve the wrong problem here. I need to study this more, but it seems that code that needs to stay within a memory limit only needs to track what's been allocated in chunks within a block, since writing there is what invokes a page fault.
Right. I guess we've discussed what we use for calculating the *used*
memory amount but I don't remember.I think I was confused by the fact that we use some different
approaches to calculate the amount of used memory. Parallel hash and
tidbitmap use the allocated chunk size whereas hash_agg_check_limits()
in nodeAgg.c uses MemoryContextMemAllocated(), which uses the
allocated block size.That's good to know. The latter says:
* After adding a new group to the hash table, check whether we need to enter
* spill mode. Allocations may happen without adding new groups (for instance,
* if the transition state size grows), so this check is imperfect.I'm willing to claim that vacuum can be imperfect also, given the tid store's properties: 1) on average much more efficient in used space, and 2) no longer bound by the 1GB limit.
I'm not sure how this affects progress reporting, because it would be nice if it didn't report dead_tuple_bytes bigger than max_dead_tuple_bytes.
Yes, the progress reporting could be confusable. Particularly, in
shared tidstore cases, the dead_tuple_bytes could be much bigger than
max_dead_tuple_bytes. Probably what we need might be functions for
MemoryContext and dsa_area to get the amount of memory that has been
allocated, by not tracking every chunk space. For example, the
functions would be like what SlabStats() does; iterate over every
block and calculates the total/free memory usage.I'm not sure we need to invent new infrastructure for this. Looking at v29 in vacuumlazy.c, the order of operations for memory accounting is:
First, get the block-level space -- stop and vacuum indexes if we exceed the limit:
/*
* Consider if we definitely have enough space to process TIDs on page
* already. If we are close to overrunning the available space for
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
if (TidStoreIsFull(vacrel->dead_items)) --> which is basically "if (TidStoreMemoryUsage(ts) > ts->control->max_bytes)"Then, after pruning the current page, store the tids and then get the block-level space again:
else if (prunestate.num_offsets > 0)
{
/* Save details of the LP_DEAD items from the page in dead_items */
TidStoreSetBlockOffsets(...);pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
TidStoreMemoryUsage(dead_items));
}Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
Thoughts?
It looks to work but it still doesn't work in a case where a shared
tidstore is created with a 64kB memory limit, right?
TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
from the beginning.
BTW I realized that since the caller can pass dsa_area to tidstore
(and radix tree), if other data are allocated in the same DSA are,
TidStoreMemoryUsage() (and RT_MEMORY_USAGE()) returns the memory usage
that includes not only itself but also other data. Probably it's
better to comment that the passed dsa_area should be dedicated to a
tidstore (or a radix tree).
But now that I'm looking more closely at the details of memory accounting, I don't like that TidStoreMemoryUsage() is called twice per page pruned (see above). Maybe it wouldn't noticeably slow things down, but it's a bit sloppy. It seems like we should call it once per loop and save the result somewhere. If that's the right way to go, that possibly indicates that TidStoreIsFull() is not a useful interface, at least in this form.
Agreed.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Mon, Mar 6, 2023 at 1:28 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
Since the block-level measurement is likely overestimating quite a bit,
I propose to simply reverse the order of the actions here, effectively
reporting progress for the *last page* and not the current one: First
update progress with the current memory usage, then add tids for this page.
If this allocated a new block, only a small bit of that will be written to.
If this block pushes it over the limit, we will detect that up at the top
of the loop. It's kind of like our earlier attempts at a "fudge factor",
but simpler and less brittle. And, as far as OS pages we have actually
written to, I think it'll effectively respect the memory limit, at least in
the local mem case. And the numbers will make sense.
Thoughts?
It looks to work but it still doesn't work in a case where a shared
tidstore is created with a 64kB memory limit, right?
TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
from the beginning.
I have two ideas:
1. Make it optional to track chunk memory space by a template parameter. It
might be tiny compared to everything else that vacuum does. That would
allow other users to avoid that overhead.
2. When context block usage exceeds the limit (rare), make the additional
effort to get the precise usage -- I'm not sure such a top-down facility
exists, and I'm not feeling well enough today to study this further.
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Mar 7, 2023 at 1:01 AM John Naylor <john.naylor@enterprisedb.com> wrote:
On Mon, Mar 6, 2023 at 1:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
Thoughts?
It looks to work but it still doesn't work in a case where a shared
tidstore is created with a 64kB memory limit, right?
TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
from the beginning.I have two ideas:
1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything else that vacuum does. That would allow other users to avoid that overhead.
2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm not sure such a top-down facility exists, and I'm not feeling well enough today to study this further.
I prefer option (1) as it's straight forward. I mentioned a similar
idea before[1]/messages/by-id/CAD21AoDK3gbX-jVxT6Pfso1Na0Krzr8Q15498Aj6tmXgzMFksA@mail.gmail.com. RT_MEMORY_USAGE() is defined only when the macro is
defined. It might be worth checking if there is visible overhead of
tracking chunk memory space. IIRC we've not evaluated it yet.
[1]: /messages/by-id/CAD21AoDK3gbX-jVxT6Pfso1Na0Krzr8Q15498Aj6tmXgzMFksA@mail.gmail.com
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Tue, Mar 7, 2023 at 8:25 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
1. Make it optional to track chunk memory space by a template
parameter. It might be tiny compared to everything else that vacuum does.
That would allow other users to avoid that overhead.
2. When context block usage exceeds the limit (rare), make the
additional effort to get the precise usage -- I'm not sure such a top-down
facility exists, and I'm not feeling well enough today to study this
further.
I prefer option (1) as it's straight forward. I mentioned a similar
idea before[1]. RT_MEMORY_USAGE() is defined only when the macro is
defined. It might be worth checking if there is visible overhead of
tracking chunk memory space. IIRC we've not evaluated it yet.
Ok, let's try this -- I can test and profile later this week.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Mar 8, 2023 at 1:40 PM John Naylor <john.naylor@enterprisedb.com> wrote:
On Tue, Mar 7, 2023 at 8:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything else that vacuum does. That would allow other users to avoid that overhead.
2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm not sure such a top-down facility exists, and I'm not feeling well enough today to study this further.I prefer option (1) as it's straight forward. I mentioned a similar
idea before[1]. RT_MEMORY_USAGE() is defined only when the macro is
defined. It might be worth checking if there is visible overhead of
tracking chunk memory space. IIRC we've not evaluated it yet.Ok, let's try this -- I can test and profile later this week.
Thanks!
I've attached the new version patches. I merged improvements and fixes
I did in the v29 patch. 0007 through 0010 are updates from v29. The
main change made in v30 is to make the memory measurement and
RT_MEMORY_USAGE() optional, which is done in 0007 patch. The 0008 and
0009 patches are the updates for tidstore and the vacuum integration
patches. Here are results of quick tests (an average of 3 executions):
query: select * from bench_load_random_int(10 * 1000 * 1000)
* w/ RT_MEASURE_MEMORY_USAGE:
mem_allocated | load_ms
---------------+---------
1996512000 | 3305
(1 row)
* w/o RT_MEASURE_MEMORY_USAGE:
mem_allocated | load_ms
---------------+---------
0 | 3258
(1 row)
It seems to be within a noise level but I agree to make it optional.
Apart from the memory measurement stuff, I've done another todo item
on my list; adding min max classes for node3 and node125. I've done
that in 0010 patch, and here is a quick test result:
query: select * from bench_load_random_int(10 * 1000 * 1000)
* w/ 0000 patch
mem_allocated | load_ms
---------------+---------
1268630080 | 3275
(1 row)
* w/o 0000 patch
mem_allocated | load_ms
---------------+---------
1996512000 | 3214
(1 row)
That's a good improvement on the memory usage, without a noticeable
performance overhead. FYI CLASS_3_MIN has 1 fanout and is 24 bytes in
size, and CLASS_125_MIN has 61 fanouts and is 768 bytes in size.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v30-0008-Remove-the-max-memory-deduction-from-TidStore.patchapplication/octet-stream; name=v30-0008-Remove-the-max-memory-deduction-from-TidStore.patchDownload
From 5e3e7098eb12ec1d7ee546cc8f6e635638f131be Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 8 Mar 2023 15:08:58 +0900
Subject: [PATCH v30 08/11] Remove the max memory deduction from TidStore.
---
src/backend/access/common/tidstore.c | 43 +++++++---------------------
1 file changed, 10 insertions(+), 33 deletions(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 2d6f2b3ab9..54e2ef29db 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -82,6 +82,7 @@ typedef uint64 offsetbm;
#define RT_SCOPE static
#define RT_DECLARE
#define RT_DEFINE
+#define RT_MEASURE_MEMORY_USAGE
#define RT_VALUE_TYPE tidkey
#include "lib/radixtree.h"
@@ -90,6 +91,7 @@ typedef uint64 offsetbm;
#define RT_SCOPE static
#define RT_DECLARE
#define RT_DEFINE
+#define RT_MEASURE_MEMORY_USAGE
#define RT_VALUE_TYPE tidkey
#include "lib/radixtree.h"
@@ -182,39 +184,15 @@ TidStoreCreate(size_t max_bytes, OffsetNumber max_off, dsa_area *area)
ts = palloc0(sizeof(TidStore));
- /*
- * Create the radix tree for the main storage.
- *
- * Memory consumption depends on the number of stored tids, but also on the
- * distribution of them, how the radix tree stores, and the memory management
- * that backed the radix tree. The maximum bytes that a TidStore can
- * use is specified by the max_bytes in TidStoreCreate(). We want the total
- * amount of memory consumption by a TidStore not to exceed the max_bytes.
- *
- * In local TidStore cases, the radix tree uses slab allocators for each kind
- * of node class. The most memory consuming case while adding Tids associated
- * with one page (i.e. during TidStoreSetBlockOffsets()) is that we allocate a new
- * slab block for a new radix tree node, which is approximately 70kB. Therefore,
- * we deduct 70kB from the max_bytes.
- *
- * In shared cases, DSA allocates the memory segments big enough to follow
- * a geometric series that approximately doubles the total DSA size (see
- * make_new_segment() in dsa.c). We simulated the how DSA increases segment
- * size and the simulation revealed, the 75% threshold for the maximum bytes
- * perfectly works in case where the max_bytes is a power-of-2, and the 60%
- * threshold works for other cases.
- */
if (area != NULL)
{
dsa_pointer dp;
- float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
LWTRANCHE_SHARED_TIDSTORE);
dp = dsa_allocate0(area, sizeof(TidStoreControl));
ts->control = (TidStoreControl *) dsa_get_address(area, dp);
- ts->control->max_bytes = (size_t) (max_bytes * ratio);
ts->area = area;
ts->control->magic = TIDSTORE_MAGIC;
@@ -225,11 +203,15 @@ TidStoreCreate(size_t max_bytes, OffsetNumber max_off, dsa_area *area)
else
{
ts->tree.local = local_rt_create(CurrentMemoryContext);
-
ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
- ts->control->max_bytes = max_bytes - (70 * 1024);
}
+ /*
+ * max_bytes is forced to be at least 64KB, the current minimum valid value
+ * for the work_mem GUC.
+ */
+ ts->control->max_bytes = Max(64 * 1024L, max_bytes);
+
ts->control->max_off = max_off;
ts->control->max_off_nbits = pg_ceil_log2_32(max_off);
@@ -333,14 +315,8 @@ TidStoreReset(TidStore *ts)
LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
- /*
- * Free the radix tree and return allocated DSA segments to
- * the operating system.
- */
- shared_rt_free(ts->tree.shared);
- dsa_trim(ts->area);
-
/* Recreate the radix tree */
+ shared_rt_free(ts->tree.shared);
ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
LWTRANCHE_SHARED_TIDSTORE);
@@ -354,6 +330,7 @@ TidStoreReset(TidStore *ts)
}
else
{
+ /* Recreate the radix tree */
local_rt_free(ts->tree.local);
ts->tree.local = local_rt_create(CurrentMemoryContext);
--
2.31.1
v30-0011-Revert-building-benchmark-module-for-CI.patchapplication/octet-stream; name=v30-0011-Revert-building-benchmark-module-for-CI.patchDownload
From 7c16882823a3d5b65f32c0147ff9f59e77500390 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 19:31:34 +0700
Subject: [PATCH v30 11/11] Revert building benchmark module for CI
---
contrib/meson.build | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/contrib/meson.build b/contrib/meson.build
index 421d469f8c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,7 +12,7 @@ subdir('amcheck')
subdir('auth_delay')
subdir('auto_explain')
subdir('basic_archive')
-subdir('bench_radix_tree')
+#subdir('bench_radix_tree')
subdir('bloom')
subdir('basebackup_to_shell')
subdir('bool_plperl')
--
2.31.1
v30-0007-Radix-tree-optionally-tracks-memory-usage-when-R.patchapplication/octet-stream; name=v30-0007-Radix-tree-optionally-tracks-memory-usage-when-R.patchDownload
From d271f527e12d91ea238f1bfef4e88220793fee76 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 8 Mar 2023 15:08:19 +0900
Subject: [PATCH v30 07/11] Radix tree optionally tracks memory usage, when
RT_MEASURE_MEMORY_USAGE.
---
contrib/bench_radix_tree/bench_radix_tree.c | 1 +
src/backend/utils/mmgr/dsa.c | 12 ---
src/include/lib/radixtree.h | 93 +++++++++++++++++--
src/include/utils/dsa.h | 1 -
.../modules/test_radixtree/test_radixtree.c | 1 +
5 files changed, 85 insertions(+), 23 deletions(-)
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 63e842395d..fc6e4cb699 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -34,6 +34,7 @@ PG_MODULE_MAGIC;
#define RT_DECLARE
#define RT_DEFINE
#define RT_USE_DELETE
+#define RT_MEASURE_MEMORY_USAGE
#define RT_VALUE_TYPE uint64
// WIP: compiles with warnings because rt_attach is defined but not used
// #define RT_SHMEM
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 80555aefff..f5a62061a3 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,18 +1024,6 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
-size_t
-dsa_get_total_size(dsa_area *area)
-{
- size_t size;
-
- LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
- size = area->control->total_segment_size;
- LWLockRelease(DSA_AREA_LOCK(area));
-
- return size;
-}
-
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 2e3963c3d5..6d65544dd0 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -84,7 +84,6 @@
* RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
* RT_ITERATE_NEXT - Return next key-value pair, if any
* RT_END_ITERATE - End iteration
- * RT_MEMORY_USAGE - Get the memory usage
*
* Interface for Shared Memory
* ---------
@@ -97,6 +96,8 @@
* ---------
*
* RT_DELETE - Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ * RT_MEMORY_USAGE - Get the memory usage. Declared/define if
+ * RT_MEASURE_MEMORY_USAGE is defined.
*
*
* Copyright (c) 2023, PostgreSQL Global Development Group
@@ -138,7 +139,9 @@
#ifdef RT_USE_DELETE
#define RT_DELETE RT_MAKE_NAME(delete)
#endif
+#ifdef RT_MEASURE_MEMORY_USAGE
#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#endif
#ifdef RT_DEBUG
#define RT_DUMP RT_MAKE_NAME(dump)
#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
@@ -150,6 +153,9 @@
#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#ifdef RT_MEASURE_MEMORY_USAGE
+#define RT_FANOUT_GET_NODE_SIZE RT_MAKE_NAME(fanout_get_node_size)
+#endif
#define RT_FREE_NODE RT_MAKE_NAME(free_node)
#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
#define RT_EXTEND_UP RT_MAKE_NAME(extend_up)
@@ -255,7 +261,9 @@ RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+#ifdef RT_MEASURE_MEMORY_USAGE
RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+#endif
#ifdef RT_DEBUG
RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
@@ -624,6 +632,10 @@ typedef struct RT_RADIX_TREE_CONTROL
uint64 max_val;
uint64 num_keys;
+#ifdef RT_MEASURE_MEMORY_USAGE
+ int64 mem_used;
+#endif
+
/* statistics */
#ifdef RT_DEBUG
int32 cnt[RT_SIZE_CLASS_COUNT];
@@ -1089,6 +1101,11 @@ RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
allocsize);
#endif
+#ifdef RT_MEASURE_MEMORY_USAGE
+ /* update memory usage */
+ tree->ctl->mem_used += allocsize;
+#endif
+
#ifdef RT_DEBUG
/* update the statistics */
tree->ctl->cnt[size_class]++;
@@ -1165,6 +1182,54 @@ RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL no
return newnode;
}
+#ifdef RT_MEASURE_MEMORY_USAGE
+/* Return the node size of the given fanout of the size class */
+static inline Size
+RT_FANOUT_GET_NODE_SIZE(int fanout, bool is_leaf)
+{
+ const Size fanout_inner_node_size[] = {
+ [3] = RT_SIZE_CLASS_INFO[RT_CLASS_3].inner_size,
+ [15] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN].inner_size,
+ [32] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX].inner_size,
+ [125] = RT_SIZE_CLASS_INFO[RT_CLASS_125].inner_size,
+ [256] = RT_SIZE_CLASS_INFO[RT_CLASS_256].inner_size,
+ };
+ const Size fanout_leaf_node_size[] = {
+ [3] = RT_SIZE_CLASS_INFO[RT_CLASS_3].leaf_size,
+ [15] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN].leaf_size,
+ [32] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX].leaf_size,
+ [125] = RT_SIZE_CLASS_INFO[RT_CLASS_125].leaf_size,
+ [256] = RT_SIZE_CLASS_INFO[RT_CLASS_256].leaf_size,
+ };
+ Size node_size;
+
+ node_size = is_leaf ?
+ fanout_leaf_node_size[fanout] : fanout_inner_node_size[fanout];
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ Size assert_node_size = 0;
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+
+ if (size_class.fanout == fanout)
+ {
+ assert_node_size = is_leaf ?
+ size_class.leaf_size : size_class.inner_size;
+ break;
+ }
+ }
+
+ Assert(node_size == assert_node_size);
+ }
+#endif
+
+ return node_size;
+}
+#endif
+
/* Free the given node */
static void
RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
@@ -1197,11 +1262,22 @@ RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
}
#endif
+#ifdef RT_MEASURE_MEMORY_USAGE
+ /* update memory usage */
+ {
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+ tree->ctl->mem_used -= RT_FANOUT_GET_NODE_SIZE(node->fanout,
+ RT_NODE_IS_LEAF(node));
+ Assert(tree->ctl->mem_used >= 0);
+ }
+#endif
+
#ifdef RT_SHMEM
dsa_free(tree->dsa, allocnode);
#else
pfree(allocnode);
#endif
+
}
/* Update the parent's pointer when growing a node */
@@ -1989,27 +2065,23 @@ RT_END_ITERATE(RT_ITER *iter)
/*
* Return the statistics of the amount of memory used by the radix tree.
*/
+#ifdef RT_MEASURE_MEMORY_USAGE
RT_SCOPE uint64
RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
{
Size total = 0;
- RT_LOCK_SHARED(tree);
-
#ifdef RT_SHMEM
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
- total = dsa_get_total_size(tree->dsa);
-#else
- for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
- {
- total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
- total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
- }
#endif
+ RT_LOCK_SHARED(tree);
+ total = tree->ctl->mem_used;
RT_UNLOCK(tree);
+
return total;
}
+#endif
/*
* Verify the radix tree node.
@@ -2476,6 +2548,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_NEW_ROOT
#undef RT_ALLOC_NODE
#undef RT_INIT_NODE
+#undef RT_FANOUT_GET_NODE_SIZE
#undef RT_FREE_NODE
#undef RT_FREE_RECURSE
#undef RT_EXTEND_UP
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 2af215484f..3ce4ee300a 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,7 +121,6 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
-extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 5a169854d9..19d286d84b 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -114,6 +114,7 @@ static const test_spec test_specs[] = {
#define RT_DECLARE
#define RT_DEFINE
#define RT_USE_DELETE
+#define RT_MEASURE_MEMORY_USAGE
#define RT_VALUE_TYPE TestValueType
/* #define RT_SHMEM */
#include "lib/radixtree.h"
--
2.31.1
v30-0009-Revert-the-update-for-the-minimum-value-of-maint.patchapplication/octet-stream; name=v30-0009-Revert-the-update-for-the-minimum-value-of-maint.patchDownload
From f7013c9023ff3f9a6707276303443f0b4e00ccbf Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 8 Mar 2023 15:09:22 +0900
Subject: [PATCH v30 09/11] Revert the update for the minimum value of
maintenance_work_mem.
---
src/backend/postmaster/autovacuum.c | 6 +++---
src/backend/utils/misc/guc_tables.c | 2 +-
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index a371f6fbba..ff6149a179 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3397,12 +3397,12 @@ check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
return true;
/*
- * We clamp manually-set values to at least 2MB. Since
+ * We clamp manually-set values to at least 1MB. Since
* maintenance_work_mem is always set to at least this value, do the same
* here.
*/
- if (*newval < 2048)
- *newval = 2048;
+ if (*newval < 1024)
+ *newval = 1024;
return true;
}
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 8a64614cd1..1c0583fe26 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2313,7 +2313,7 @@ struct config_int ConfigureNamesInt[] =
GUC_UNIT_KB
},
&maintenance_work_mem,
- 65536, 2048, MAX_KILOBYTES,
+ 65536, 1024, MAX_KILOBYTES,
NULL, NULL, NULL
},
--
2.31.1
v30-0010-Add-min-and-max-classes-for-node3-and-node125.patchapplication/octet-stream; name=v30-0010-Add-min-and-max-classes-for-node3-and-node125.patchDownload
From ba41d3bfcf0d3016c61948ce6acc0d9582d8aad8 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 9 Mar 2023 11:42:17 +0900
Subject: [PATCH v30 10/11] Add min and max classes for node3 and node125.
---
src/include/lib/radixtree.h | 70 +++++++++++++------
src/include/lib/radixtree_insert_impl.h | 56 ++++++++++++++-
.../expected/test_radixtree.out | 4 ++
.../modules/test_radixtree/test_radixtree.c | 6 +-
4 files changed, 110 insertions(+), 26 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 6d65544dd0..b655f4a2a2 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -225,10 +225,12 @@
#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
-#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_3_MIN RT_MAKE_NAME(class_3_min)
+#define RT_CLASS_3_MAX RT_MAKE_NAME(class_3_max)
#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
-#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_125_MIN RT_MAKE_NAME(class_125_min)
+#define RT_CLASS_125_MAX RT_MAKE_NAME(class_125_max)
#define RT_CLASS_256 RT_MAKE_NAME(class_256)
/* generate forward declarations necessary to use the radix tree */
@@ -561,10 +563,12 @@ typedef struct RT_NODE_LEAF_256
*/
typedef enum RT_SIZE_CLASS
{
- RT_CLASS_3 = 0,
+ RT_CLASS_3_MIN = 0,
+ RT_CLASS_3_MAX,
RT_CLASS_32_MIN,
RT_CLASS_32_MAX,
- RT_CLASS_125,
+ RT_CLASS_125_MIN,
+ RT_CLASS_125_MAX,
RT_CLASS_256
} RT_SIZE_CLASS;
@@ -580,7 +584,13 @@ typedef struct RT_SIZE_CLASS_ELEM
} RT_SIZE_CLASS_ELEM;
static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
- [RT_CLASS_3] = {
+ [RT_CLASS_3_MIN] = {
+ .name = "radix tree node 1",
+ .fanout = 1,
+ .inner_size = sizeof(RT_NODE_INNER_3) + 1 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_3) + 1 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_3_MAX] = {
.name = "radix tree node 3",
.fanout = 3,
.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
@@ -598,7 +608,13 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
},
- [RT_CLASS_125] = {
+ [RT_CLASS_125_MIN] = {
+ .name = "radix tree node 125",
+ .fanout = 61,
+ .inner_size = sizeof(RT_NODE_INNER_125) + 61 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_125) + 61 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_125_MAX] = {
.name = "radix tree node 125",
.fanout = 125,
.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
@@ -934,7 +950,7 @@ static inline void
RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
{
- const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_MAX].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
@@ -946,7 +962,7 @@ static inline void
RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
{
- const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_MAX].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
@@ -1152,9 +1168,9 @@ RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL newnode;
- allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_MIN, is_leaf);
newnode = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3_MIN, is_leaf);
newnode->shift = shift;
tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
tree->ctl->root = allocnode;
@@ -1188,17 +1204,21 @@ static inline Size
RT_FANOUT_GET_NODE_SIZE(int fanout, bool is_leaf)
{
const Size fanout_inner_node_size[] = {
- [3] = RT_SIZE_CLASS_INFO[RT_CLASS_3].inner_size,
+ [1] = RT_SIZE_CLASS_INFO[RT_CLASS_3_MIN].inner_size,
+ [3] = RT_SIZE_CLASS_INFO[RT_CLASS_3_MAX].inner_size,
[15] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN].inner_size,
[32] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX].inner_size,
- [125] = RT_SIZE_CLASS_INFO[RT_CLASS_125].inner_size,
+ [61] = RT_SIZE_CLASS_INFO[RT_CLASS_125_MIN].inner_size,
+ [125] = RT_SIZE_CLASS_INFO[RT_CLASS_125_MAX].inner_size,
[256] = RT_SIZE_CLASS_INFO[RT_CLASS_256].inner_size,
};
const Size fanout_leaf_node_size[] = {
- [3] = RT_SIZE_CLASS_INFO[RT_CLASS_3].leaf_size,
+ [1] = RT_SIZE_CLASS_INFO[RT_CLASS_3_MIN].leaf_size,
+ [3] = RT_SIZE_CLASS_INFO[RT_CLASS_3_MAX].leaf_size,
[15] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN].leaf_size,
[32] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX].leaf_size,
- [125] = RT_SIZE_CLASS_INFO[RT_CLASS_125].leaf_size,
+ [61] = RT_SIZE_CLASS_INFO[RT_CLASS_125_MIN].leaf_size,
+ [125] = RT_SIZE_CLASS_INFO[RT_CLASS_125_MAX].leaf_size,
[256] = RT_SIZE_CLASS_INFO[RT_CLASS_256].leaf_size,
};
Size node_size;
@@ -1337,9 +1357,9 @@ RT_EXTEND_UP(RT_RADIX_TREE *tree, uint64 key)
RT_PTR_LOCAL node;
RT_NODE_INNER_3 *n3;
- allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_MIN, true);
node = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+ RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3_MIN, true);
node->shift = shift;
node->count = 1;
@@ -1375,9 +1395,9 @@ RT_EXTEND_DOWN(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_L
int newshift = shift - RT_NODE_SPAN;
bool is_leaf = newshift == 0;
- allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3_MIN, is_leaf);
newchild = RT_PTR_GET_LOCAL(tree, allocchild);
- RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3_MIN, is_leaf);
newchild->shift = newshift;
RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
@@ -2177,12 +2197,14 @@ RT_STATS(RT_RADIX_TREE *tree)
{
RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
- fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+ fprintf(stderr, "height = %d, n1 = %u, n3 = %u, n15 = %u, n32 = %u, n61 = %u, n125 = %u, n256 = %u\n",
root->shift / RT_NODE_SPAN,
- tree->ctl->cnt[RT_CLASS_3],
+ tree->ctl->cnt[RT_CLASS_3_MIN],
+ tree->ctl->cnt[RT_CLASS_3_MAX],
tree->ctl->cnt[RT_CLASS_32_MIN],
tree->ctl->cnt[RT_CLASS_32_MAX],
- tree->ctl->cnt[RT_CLASS_125],
+ tree->ctl->cnt[RT_CLASS_125_MIN],
+ tree->ctl->cnt[RT_CLASS_125_MAX],
tree->ctl->cnt[RT_CLASS_256]);
}
@@ -2519,10 +2541,12 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_SIZE_CLASS
#undef RT_SIZE_CLASS_ELEM
#undef RT_SIZE_CLASS_INFO
-#undef RT_CLASS_3
+#undef RT_CLASS_3_MIN
+#undef RT_CLASS_3_MAX
#undef RT_CLASS_32_MIN
#undef RT_CLASS_32_MAX
-#undef RT_CLASS_125
+#undef RT_CLASS_125_MIN
+#undef RT_CLASS_125_MAX
#undef RT_CLASS_256
/* function declarations */
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index d56e58dcac..d10093dfba 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -42,6 +42,7 @@
{
case RT_NODE_KIND_3:
{
+ const RT_SIZE_CLASS_ELEM class3_max = RT_SIZE_CLASS_INFO[RT_CLASS_3_MAX];
RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
#ifdef RT_NODE_LEVEL_LEAF
@@ -55,6 +56,32 @@
break;
}
#endif
+ if (unlikely(RT_NODE_MUST_GROW(n3)) &&
+ n3->base.n.fanout < class3_max.fanout)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ const RT_SIZE_CLASS_ELEM class3_min = RT_SIZE_CLASS_INFO[RT_CLASS_3_MIN];
+ const RT_SIZE_CLASS new_class = RT_CLASS_3_MAX;
+
+ Assert(n3->base.n.fanout == class3_min.fanout);
+
+ /* grow to the next size class of this kind */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ n3 = (RT_NODE3_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ memcpy(newnode, node, class3_min.leaf_size);
+#else
+ memcpy(newnode, node, class3_min.inner_size);
+#endif
+ newnode->fanout = class3_max.fanout;
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+
if (unlikely(RT_NODE_MUST_GROW(n3)))
{
RT_PTR_ALLOC allocnode;
@@ -154,7 +181,7 @@
RT_PTR_LOCAL newnode;
RT_NODE125_TYPE *new125;
const uint8 new_kind = RT_NODE_KIND_125;
- const RT_SIZE_CLASS new_class = RT_CLASS_125;
+ const RT_SIZE_CLASS new_class = RT_CLASS_125_MIN;
Assert(n32->base.n.fanout == class32_max.fanout);
@@ -213,6 +240,7 @@
/* FALLTHROUGH */
case RT_NODE_KIND_125:
{
+ const RT_SIZE_CLASS_ELEM class125_max = RT_SIZE_CLASS_INFO[RT_CLASS_125_MAX];
RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
int slotpos;
int cnt = 0;
@@ -227,6 +255,32 @@
break;
}
#endif
+ if (unlikely(RT_NODE_MUST_GROW(n125)) &&
+ n125->base.n.fanout < class125_max.fanout)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ const RT_SIZE_CLASS_ELEM class125_min = RT_SIZE_CLASS_INFO[RT_CLASS_125_MIN];
+ const RT_SIZE_CLASS new_class = RT_CLASS_125_MAX;
+
+ Assert(n125->base.n.fanout == class125_min.fanout);
+
+ /* grow to the next size class of this kind */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ n125 = (RT_NODE125_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ memcpy(newnode, node, class125_min.leaf_size);
+#else
+ memcpy(newnode, node, class125_min.inner_size);
+#endif
+ newnode->fanout = class125_max.fanout;
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+
if (unlikely(RT_NODE_MUST_GROW(n125)))
{
RT_PTR_ALLOC allocnode;
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index 7ad1ce3605..f2b1d7e4f8 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -4,12 +4,16 @@ CREATE EXTENSION test_radixtree;
-- an error if something fails.
--
SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 1
+NOTICE: testing basic operations with inner node 1
NOTICE: testing basic operations with leaf node 3
NOTICE: testing basic operations with inner node 3
NOTICE: testing basic operations with leaf node 15
NOTICE: testing basic operations with inner node 15
NOTICE: testing basic operations with leaf node 32
NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 61
+NOTICE: testing basic operations with inner node 61
NOTICE: testing basic operations with leaf node 125
NOTICE: testing basic operations with inner node 125
NOTICE: testing basic operations with leaf node 256
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 19d286d84b..4f38b6e3de 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -47,10 +47,12 @@ static const bool rt_test_stats = false;
* XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO?
*/
static int rt_node_class_fanouts[] = {
- 3, /* RT_CLASS_3 */
+ 1, /* RT_CLASS_3_MIN */
+ 3, /* RT_CLASS_3_MAX */
15, /* RT_CLASS_32_MIN */
32, /* RT_CLASS_32_MAX */
- 125, /* RT_CLASS_125 */
+ 61, /* RT_CLASS_125_MIN */
+ 125, /* RT_CLASS_125_MAX */
256 /* RT_CLASS_256 */
};
/*
--
2.31.1
v30-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchapplication/octet-stream; name=v30-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload
From b4e4ea5f22ee8898fa7ef58a21d0da1d4d661a0a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 7 Feb 2023 17:19:29 +0700
Subject: [PATCH v30 06/11] Use TIDStore for storing dead tuple TID during lazy
vacuum
Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which was not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.
Now we use TIDStore to store dead tuple TIDs. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.
Since we are no longer able to exactly estimate the maximum number of
TIDs can be stored the pg_stat_progress_vacuum shows the progress
information based on the amount of memory in bytes. The column names
are also changed to max_dead_tuple_bytes and dead_tuple_bytes.
In addition, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, the inital DSA
segment size. Due to that, we increase the minimum value of
maintenance_work_mem (also autovacuum_work_mem) from 1MB to 2MB.
XXX: needs to bump catalog version
---
doc/src/sgml/monitoring.sgml | 8 +-
src/backend/access/heap/vacuumlazy.c | 279 ++++++++-------------
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 78 +-----
src/backend/commands/vacuumparallel.c | 66 +++--
src/backend/postmaster/autovacuum.c | 6 +-
src/backend/storage/lmgr/lwlock.c | 2 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/commands/progress.h | 4 +-
src/include/commands/vacuum.h | 25 +-
src/include/storage/lwlock.h | 1 +
src/test/regress/expected/cluster.out | 2 +-
src/test/regress/expected/create_index.out | 2 +-
src/test/regress/expected/rules.out | 4 +-
src/test/regress/sql/cluster.sql | 2 +-
src/test/regress/sql/create_index.sql | 2 +-
16 files changed, 174 insertions(+), 311 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 97d588b1d8..61e163636a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7170,10 +7170,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>max_dead_tuples</structfield> <type>bigint</type>
+ <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples that we can store before needing to perform
+ Amount of dead tuple data that we can store before needing to perform
an index vacuum cycle, based on
<xref linkend="guc-maintenance-work-mem"/>.
</para></entry>
@@ -7181,10 +7181,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>num_dead_tuples</structfield> <type>bigint</type>
+ <structfield>dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples collected since the last index vacuum cycle.
+ Amount of dead tuple data collected since the last index vacuum cycle.
</para></entry>
</row>
</tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..edb9079124 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,17 @@
* vacuumlazy.c
* Concurrent ("lazy") vacuuming.
*
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs
* that are to be removed from indexes. We want to ensure we can vacuum even
* the very largest relations with finite memory space usage. To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
*
* We are willing to use at most maintenance_work_mem (or perhaps
- * autovacuum_work_mem) memory space to keep track of dead TIDs. We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables). If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * autovacuum_work_mem) memory space to keep track of dead TIDs. If the
+ * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
+ * the pages that we've pruned). This frees up the memory space dedicated to
+ * to store dead TIDs.
*
* In practice VACUUM will often complete its initial pass over the target
* heap relation without ever running out of space to store TIDs. This means
@@ -40,6 +39,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -188,7 +188,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TidStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -220,11 +220,14 @@ typedef struct LVRelState
typedef struct LVPagePruneState
{
bool hastup; /* Page prevents rel truncation? */
- bool has_lpdead_items; /* includes existing LP_DEAD items */
+
+ /* collected offsets of LP_DEAD items including existing ones */
+ OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+ int num_offsets;
/*
* State describes the proper VM bit states to set for the page following
- * pruning and freezing. all_visible implies !has_lpdead_items, but don't
+ * pruning and freezing. all_visible implies num_offsets == 0, but don't
* trust all_frozen result unless all_visible is also set to true.
*/
bool all_visible; /* Every item visible to all? */
@@ -259,8 +262,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -487,11 +491,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
}
/*
- * Allocate dead_items array memory using dead_items_alloc. This handles
- * parallel VACUUM initialization as part of allocating shared memory
- * space used for dead_items. (But do a failsafe precheck first, to
- * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
- * is already dangerously old.)
+ * Allocate dead_items memory using dead_items_alloc. This handles parallel
+ * VACUUM initialization as part of allocating shared memory space used for
+ * dead_items. (But do a failsafe precheck first, to ensure that parallel
+ * VACUUM won't be attempted at all when relfrozenxid is already dangerously
+ * old.)
*/
lazy_check_wraparound_failsafe(vacrel);
dead_items_alloc(vacrel, params->nworkers);
@@ -797,7 +801,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
* have collected the TIDs whose index tuples need to be removed.
*
* Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- * largely consists of marking LP_DEAD items (from collected TID array)
+ * largely consists of marking LP_DEAD items (from vacrel->dead_items)
* as LP_UNUSED. This has to happen in a second, final pass over the
* heap, to preserve a basic invariant that all index AMs rely on: no
* extant index tuple can ever be allowed to contain a TID that points to
@@ -825,21 +829,21 @@ lazy_scan_heap(LVRelState *vacrel)
blkno,
next_unskippable_block,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TidStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
const int initprog_index[] = {
PROGRESS_VACUUM_PHASE,
PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
- PROGRESS_VACUUM_MAX_DEAD_TUPLES
+ PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
};
int64 initprog_val[3];
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = TidStoreMaxMemory(vacrel->dead_items);
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +910,7 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ if (TidStoreIsFull(vacrel->dead_items))
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -969,7 +972,7 @@ lazy_scan_heap(LVRelState *vacrel)
continue;
}
- /* Collect LP_DEAD items in dead_items array, count tuples */
+ /* Collect LP_DEAD items in dead_items, count tuples */
if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
&recordfreespace))
{
@@ -1011,14 +1014,14 @@ lazy_scan_heap(LVRelState *vacrel)
* Prune, freeze, and count tuples.
*
* Accumulates details of remaining LP_DEAD line pointers on page in
- * dead_items array. This includes LP_DEAD line pointers that we
- * pruned ourselves, as well as existing LP_DEAD line pointers that
- * were pruned some time earlier. Also considers freezing XIDs in the
- * tuple headers of remaining items with storage.
+ * dead_items. This includes LP_DEAD line pointers that we pruned
+ * ourselves, as well as existing LP_DEAD line pointers that were pruned
+ * some time earlier. Also considers freezing XIDs in the tuple headers
+ * of remaining items with storage.
*/
lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
- Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+ Assert(!prunestate.all_visible || (prunestate.num_offsets == 0));
/* Remember the location of the last page with nonremovable tuples */
if (prunestate.hastup)
@@ -1034,14 +1037,12 @@ lazy_scan_heap(LVRelState *vacrel)
* performed here can be thought of as the one-pass equivalent of
* a call to lazy_vacuum().
*/
- if (prunestate.has_lpdead_items)
+ if (prunestate.num_offsets > 0)
{
Size freespace;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
- /* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+ prunestate.num_offsets, buf, vmbuffer);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1079,16 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(TidStoreNumTids(dead_items) == 0);
+ }
+ else if (prunestate.num_offsets > 0)
+ {
+ /* Save details of the LP_DEAD items from the page in dead_items */
+ TidStoreSetBlockOffsets(dead_items, blkno, prunestate.deadoffsets,
+ prunestate.num_offsets);
+
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ TidStoreMemoryUsage(dead_items));
}
/*
@@ -1145,7 +1155,7 @@ lazy_scan_heap(LVRelState *vacrel)
* There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
* set, however.
*/
- else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+ else if ((prunestate.num_offsets > 0) && PageIsAllVisible(page))
{
elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
vacrel->relname, blkno);
@@ -1193,7 +1203,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Final steps for block: drop cleanup lock, record free space in the
* FSM
*/
- if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+ if ((prunestate.num_offsets > 0) && vacrel->do_index_vacuuming)
{
/*
* Wait until lazy_vacuum_heap_rel() to save free space. This
@@ -1249,7 +1259,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (TidStoreNumTids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1524,9 +1534,9 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
* The approach we take now is to restart pruning when the race condition is
* detected. This allows heap_page_prune() to prune the tuples inserted by
* the now-aborted transaction. This is a little crude, but it guarantees
- * that any items that make it into the dead_items array are simple LP_DEAD
- * line pointers, and that every remaining item with tuple storage is
- * considered as a candidate for freezing.
+ * that any items that make it into the dead_items are simple LP_DEAD line
+ * pointers, and that every remaining item with tuple storage is considered
+ * as a candidate for freezing.
*/
static void
lazy_scan_prune(LVRelState *vacrel,
@@ -1543,13 +1553,11 @@ lazy_scan_prune(LVRelState *vacrel,
HTSV_Result res;
int tuples_deleted,
tuples_frozen,
- lpdead_items,
live_tuples,
recently_dead_tuples;
int nnewlpdead;
HeapPageFreeze pagefrz;
int64 fpi_before = pgWalUsage.wal_fpi;
- OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1571,7 +1579,6 @@ retry:
pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
tuples_deleted = 0;
tuples_frozen = 0;
- lpdead_items = 0;
live_tuples = 0;
recently_dead_tuples = 0;
@@ -1580,9 +1587,9 @@ retry:
*
* We count tuples removed by the pruning step as tuples_deleted. Its
* final value can be thought of as the number of tuples that have been
- * deleted from the table. It should not be confused with lpdead_items;
- * lpdead_items's final value can be thought of as the number of tuples
- * that were deleted from indexes.
+ * deleted from the table. It should not be confused with
+ * prunestate->deadoffsets; prunestate->deadoffsets's final value can
+ * be thought of as the number of tuples that were deleted from indexes.
*/
tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
InvalidTransactionId, 0, &nnewlpdead,
@@ -1593,7 +1600,7 @@ retry:
* requiring freezing among remaining tuples with storage
*/
prunestate->hastup = false;
- prunestate->has_lpdead_items = false;
+ prunestate->num_offsets = 0;
prunestate->all_visible = true;
prunestate->all_frozen = true;
prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1638,7 +1645,7 @@ retry:
* (This is another case where it's useful to anticipate that any
* LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
*/
- deadoffsets[lpdead_items++] = offnum;
+ prunestate->deadoffsets[prunestate->num_offsets++] = offnum;
continue;
}
@@ -1875,7 +1882,7 @@ retry:
*/
#ifdef USE_ASSERT_CHECKING
/* Note that all_frozen value does not matter when !all_visible */
- if (prunestate->all_visible && lpdead_items == 0)
+ if (prunestate->all_visible && prunestate->num_offsets == 0)
{
TransactionId cutoff;
bool all_frozen;
@@ -1888,28 +1895,9 @@ retry:
}
#endif
- /*
- * Now save details of the LP_DEAD items from the page in vacrel
- */
- if (lpdead_items > 0)
+ if (prunestate->num_offsets > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
-
vacrel->lpdead_item_pages++;
- prunestate->has_lpdead_items = true;
-
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
/*
* It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1928,7 +1916,7 @@ retry:
/* Finally, add page-local counts to whole-VACUUM counts */
vacrel->tuples_deleted += tuples_deleted;
vacrel->tuples_frozen += tuples_frozen;
- vacrel->lpdead_items += lpdead_items;
+ vacrel->lpdead_items += prunestate->num_offsets;
vacrel->live_tuples += live_tuples;
vacrel->recently_dead_tuples += recently_dead_tuples;
}
@@ -1940,7 +1928,7 @@ retry:
* lazy_scan_prune, which requires a full cleanup lock. While pruning isn't
* performed here, it's quite possible that an earlier opportunistic pruning
* operation left LP_DEAD items behind. We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items for removal from indexes.
*
* For aggressive VACUUM callers, we may return false to indicate that a full
* cleanup lock is required for processing by lazy_scan_prune. This is only
@@ -2099,7 +2087,7 @@ lazy_scan_noprune(LVRelState *vacrel,
vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
vacrel->NewRelminMxid = NoFreezePageRelminMxid;
- /* Save any LP_DEAD items found on the page in dead_items array */
+ /* Save any LP_DEAD items found on the page in dead_items */
if (vacrel->nindexes == 0)
{
/* Using one-pass strategy (since table has no indexes) */
@@ -2129,8 +2117,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TidStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2126,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
+ TidStoreSetBlockOffsets(dead_items, blkno, deadoffsets, lpdead_items);
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ TidStoreMemoryUsage(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2198,7 +2178,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ TidStoreReset(vacrel->dead_items);
return;
}
@@ -2227,7 +2207,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2254,8 +2234,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2300,7 +2280,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ TidStoreReset(vacrel->dead_items);
}
/*
@@ -2373,7 +2353,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2392,9 +2372,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
/*
* lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
*
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
*
* We may also be able to truncate the line pointer array of the heap pages we
* visit. If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2410,10 +2389,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index = 0;
BlockNumber vacuumed_pages = 0;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TidStoreIter *iter;
+ TidStoreIterResult *iter_result;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2408,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
VACUUM_ERRCB_PHASE_VACUUM_HEAP,
InvalidBlockNumber, InvalidOffsetNumber);
- while (index < vacrel->dead_items->num_items)
+ iter = TidStoreBeginIterate(vacrel->dead_items);
+ while ((iter_result = TidStoreIterateNext(iter)) != NULL)
{
BlockNumber blkno;
Buffer buf;
@@ -2437,7 +2418,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ blkno = iter_result->blkno;
vacrel->blkno = blkno;
/*
@@ -2451,7 +2432,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+ lazy_vacuum_heap_page(vacrel, blkno, iter_result->offsets,
+ iter_result->num_offsets, buf, vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2461,6 +2443,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
vacuumed_pages++;
}
+ TidStoreEndIterate(iter);
vacrel->blkno = InvalidBlockNumber;
if (BufferIsValid(vmbuffer))
@@ -2470,36 +2453,31 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, TidStoreNumTids(vacrel->dead_items),
+ vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
}
/*
- * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- * vacrel->dead_items array.
+ * lazy_vacuum_heap_page() -- free page's LP_DEAD items.
*
* Caller must have an exclusive buffer lock on the buffer (though a full
* cleanup lock is also acceptable). vmbuffer must be valid and already have
* a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page. The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *deadoffsets, int num_offsets, Buffer buffer,
+ Buffer vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int nunused = 0;
@@ -2518,16 +2496,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = deadoffsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2570,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -2687,8 +2659,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
* lazy_vacuum_one_index() -- vacuum index relation.
*
* Delete all the index tuples containing a TID collected in
- * vacrel->dead_items array. Also update running statistics.
- * Exact details depend on index AM's ambulkdelete routine.
+ * vacrel->dead_items. Also update running statistics. Exact
+ * details depend on index AM's ambulkdelete routine.
*
* reltuples is the number of heap tuples to be passed to the
* bulkdelete callback. It's always assumed to be estimated.
@@ -3094,48 +3066,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
}
/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
-/*
- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
+ * in vacrel for caller.
*
* Also handles parallel initialization as part of allocating dead_items in
* DSM when required.
@@ -3143,11 +3075,9 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
+ int vac_work_mem = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
@@ -3174,7 +3104,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
+ vac_work_mem, MaxHeapTuplesPerPage,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3187,11 +3117,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = TidStoreCreate(vac_work_mem, MaxHeapTuplesPerPage,
+ NULL);
}
/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 34ca0e739f..149d41b41c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,7 +1180,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
END AS phase,
S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
- S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+ S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index aa79d9de4d..5fb30d7e62 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int vac_cmp_itemptr(const void *left, const void *right);
/*
* Primary entry point for manual VACUUM and ANALYZE commands
@@ -2303,16 +2302,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TidStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ TidStoreNumTids(dead_items))));
return istat;
}
@@ -2343,82 +2342,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
* This has the right signature to be an IndexBulkDeleteCallback.
- *
- * Assumes dead_items array is sorted (in ascending TID order).
*/
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch(itemptr,
- dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
-
- return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
- BlockNumber lblk,
- rblk;
- OffsetNumber loff,
- roff;
-
- lblk = ItemPointerGetBlockNumber((ItemPointer) left);
- rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
- if (lblk < rblk)
- return -1;
- if (lblk > rblk)
- return 1;
-
- loff = ItemPointerGetOffsetNumber((ItemPointer) left);
- roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
- if (loff < roff)
- return -1;
- if (loff > roff)
- return 1;
+ TidStore *dead_items = (TidStore *) state;
- return 0;
+ return TidStoreIsMember(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..9225daf3ab 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -9,10 +9,10 @@
* In a parallel vacuum, we perform both index bulk deletion and index cleanup
* with parallel worker processes. Individual indexes are processed by one
* vacuum process. ParalleVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment. We
+ * the memory space for storing dead items allocated in the DSA area. We
* launch parallel worker processes at the start of parallel index
* bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit. Each time we process indexes in parallel,
+ * parallel worker processes exit. Each time we process indexes in parallel,
* the parallel context is re-initialized so that the same DSM can be used for
* multiple passes of index bulk-deletion and index cleanup.
*
@@ -103,6 +103,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TidStore */
+ TidStoreHandle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ dsa_area *dead_items_area;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int max_offset, int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Initial size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = TidStoreCreate(vac_work_mem, max_offset, dead_items_dsa);
+ pvs->dead_items = dead_items;
+ pvs->dead_items_area = dead_items_dsa;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = TidStoreGetHandle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ TidStoreDestroy(pvs->dead_items);
+ dsa_detach(pvs->dead_items_area);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TidStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = TidStoreAttach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ TidStoreDetach(dead_items);
+ dsa_detach(dead_items_area);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index ff6149a179..a371f6fbba 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3397,12 +3397,12 @@ check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
return true;
/*
- * We clamp manually-set values to at least 1MB. Since
+ * We clamp manually-set values to at least 2MB. Since
* maintenance_work_mem is always set to at least this value, do the same
* here.
*/
- if (*newval < 1024)
- *newval = 1024;
+ if (*newval < 2048)
+ *newval = 2048;
return true;
}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 55b3a04097..c223a7dc94 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -192,6 +192,8 @@ static const char *const BuiltinTrancheNames[] = {
"LogicalRepLauncherDSA",
/* LWTRANCHE_LAUNCHER_HASH: */
"LogicalRepLauncherHash",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1c0583fe26..8a64614cd1 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2313,7 +2313,7 @@ struct config_int ConfigureNamesInt[] =
GUC_UNIT_KB
},
&maintenance_work_mem,
- 65536, 1024, MAX_KILOBYTES,
+ 65536, 2048, MAX_KILOBYTES,
NULL, NULL, NULL
},
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
#define PROGRESS_VACUUM_HEAP_BLKS_SCANNED 2
#define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED 3
#define PROGRESS_VACUUM_NUM_INDEX_VACUUMS 4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES 5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES 5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES 6
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..a3ebb169ef 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
MultiXactId MultiXactCutoff;
};
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TidStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int vac_work_mem, int max_offset,
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 07002fdfbe..537b34b30c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DATA,
LWTRANCHE_LAUNCHER_DSA,
LWTRANCHE_LAUNCHER_HASH,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
-- ensure we don't use the index in CLUSTER nor the checking SELECTs
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index acfd9d1f4f..d320ad87dd 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
-- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e953d1f515..ef46c2994f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2032,8 +2032,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param3 AS heap_blks_scanned,
s.param4 AS heap_blks_vacuumed,
s.param5 AS index_vacuum_count,
- s.param6 AS max_dead_tuples,
- s.param7 AS num_dead_tuples
+ s.param6 AS max_dead_tuple_bytes,
+ s.param7 AS dead_tuple_bytes
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f300..d6e2471b00 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
--
2.31.1
v30-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchapplication/octet-stream; name=v30-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload
From 95bb8dc701efa4a5923a355880b60885dc18cfa3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v30 04/11] Add TIDStore, to store sets of TIDs
(ItemPointerData) efficiently.
The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.
The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.
This includes a unit test module, in src/test/modules/test_tidstore.
---
doc/src/sgml/monitoring.sgml | 4 +
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 710 ++++++++++++++++++
src/backend/storage/lmgr/lwlock.c | 2 +
src/include/access/tidstore.h | 50 ++
src/include/storage/lwlock.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_tidstore/Makefile | 23 +
.../test_tidstore/expected/test_tidstore.out | 13 +
src/test/modules/test_tidstore/meson.build | 35 +
.../test_tidstore/sql/test_tidstore.sql | 7 +
.../test_tidstore/test_tidstore--1.0.sql | 8 +
.../modules/test_tidstore/test_tidstore.c | 228 ++++++
.../test_tidstore/test_tidstore.control | 4 +
16 files changed, 1089 insertions(+)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
create mode 100644 src/test/modules/test_tidstore/Makefile
create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
create mode 100644 src/test/modules/test_tidstore/meson.build
create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
create mode 100644 src/test/modules/test_tidstore/test_tidstore.control
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6249bb50d0..97d588b1d8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2203,6 +2203,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Waiting to access a shared TID bitmap during a parallel bitmap
index scan.</entry>
</row>
+ <row>
+ <entry><literal>SharedTidStore</literal></entry>
+ <entry>Waiting to access a shared TID store.</entry>
+ </row>
<row>
<entry><literal>SharedTupleStore</literal></entry>
<entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..2d6f2b3ab9
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,710 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * Tid (ItemPointerData) storage implementation.
+ *
+ * TidStore is a in-memory data structure to store tids (ItemPointerData).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
+ * and stored in the radix tree.
+ *
+ * TidStore can be shared among parallel worker processes by passing DSA area
+ * to TidStoreCreate(). Other backends can attach to the shared TidStore by
+ * TidStoreAttach().
+ *
+ * Regarding the concurrency support, we use a single LWLock for the TidStore.
+ * The TidStore is exclusively locked when inserting encoded tids to the
+ * radix tree or when resetting itself. When searching on the TidStore or
+ * doing the iteration, it is not locked but the underlying radix tree is
+ * locked in shared mode.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, a tid is represented as a pair of 64-bit key and
+ * 64-bit value.
+ *
+ * First, we construct a 64-bit unsigned integer by combining the block
+ * number and the offset number. The number of bits used for the offset number
+ * is specified by max_off in TidStoreCreate(). We are frugal with the bits,
+ * because smaller keys could help keeping the radix tree shallow.
+ *
+ * For example, a tid of heap on a 8kB block uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. 9 bits
+ * are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks. That is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * Then, 64-bit value is the bitmap representation of the lowest 6 bits
+ * (LOWER_OFFSET_NBITS) of the integer, and 64-bit key consists of the
+ * upper 3 bits of the offset number and the block number, 35 bits in
+ * total:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ * |----| value
+ * |--------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ *
+ * If the number of bits required for offset numbers fits in LOWER_OFFSET_NBITS,
+ * 64-bit value is the bitmap representation of the offset number, and the
+ * 64-bit key is the block number.
+ */
+typedef uint64 tidkey;
+typedef uint64 offsetbm;
+#define LOWER_OFFSET_NBITS 6 /* log(sizeof(offsetbm), 2) */
+#define LOWER_OFFSET_MASK ((1 << LOWER_OFFSET_NBITS) - 1)
+
+/* A magic value used to identify our TidStore. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE tidkey
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE tidkey
+#include "lib/radixtree.h"
+
+/* The control object for a TidStore */
+typedef struct TidStoreControl
+{
+ /* the number of tids in the store */
+ int64 num_tids;
+
+ /* These values are never changed after creation */
+ size_t max_bytes; /* the maximum bytes a TidStore can use */
+ OffsetNumber max_off; /* the maximum offset number */
+ int max_off_nbits; /* the number of bits required for offset
+ * numbers */
+ int upper_off_nbits; /* the number of bits of offset numbers
+ * used in a key */
+
+ /* The below fields are used only in shared case */
+
+ uint32 magic;
+ LWLock lock;
+
+ /* handles for TidStore and radix tree */
+ TidStoreHandle handle;
+ shared_rt_handle tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+ /*
+ * Control object. This is allocated in DSA area 'area' in the shared
+ * case, otherwise in backend-local memory.
+ */
+ TidStoreControl *control;
+
+ /* Storage for Tids. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ local_rt_radix_tree *local;
+ shared_rt_radix_tree *shared;
+ } tree;
+
+ /* DSA area for TidStore if used */
+ dsa_area *area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+ TidStore *ts;
+
+ /* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ shared_rt_iter *shared;
+ local_rt_iter *local;
+ } tree_iter;
+
+ /* we returned all tids? */
+ bool finished;
+
+ /* save for the next iteration */
+ tidkey next_tidkey;
+ offsetbm next_off_bitmap;
+
+ /*
+ * output for the caller. Must be last because variable-size.
+ */
+ TidStoreIterResult output;
+} TidStoreIter;
+
+static void iter_decode_key_off(TidStoreIter *iter, tidkey key, offsetbm off_bitmap);
+static inline BlockNumber key_get_blkno(TidStore *ts, tidkey key);
+static inline tidkey encode_blk_off(TidStore *ts, BlockNumber block,
+ OffsetNumber offset, offsetbm *off_bit);
+static inline tidkey encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+TidStoreCreate(size_t max_bytes, OffsetNumber max_off, dsa_area *area)
+{
+ TidStore *ts;
+
+ Assert(max_off <= MaxOffsetNumber);
+
+ ts = palloc0(sizeof(TidStore));
+
+ /*
+ * Create the radix tree for the main storage.
+ *
+ * Memory consumption depends on the number of stored tids, but also on the
+ * distribution of them, how the radix tree stores, and the memory management
+ * that backed the radix tree. The maximum bytes that a TidStore can
+ * use is specified by the max_bytes in TidStoreCreate(). We want the total
+ * amount of memory consumption by a TidStore not to exceed the max_bytes.
+ *
+ * In local TidStore cases, the radix tree uses slab allocators for each kind
+ * of node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during TidStoreSetBlockOffsets()) is that we allocate a new
+ * slab block for a new radix tree node, which is approximately 70kB. Therefore,
+ * we deduct 70kB from the max_bytes.
+ *
+ * In shared cases, DSA allocates the memory segments big enough to follow
+ * a geometric series that approximately doubles the total DSA size (see
+ * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+ * size and the simulation revealed, the 75% threshold for the maximum bytes
+ * perfectly works in case where the max_bytes is a power-of-2, and the 60%
+ * threshold works for other cases.
+ */
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+ float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+ LWTRANCHE_SHARED_TIDSTORE);
+
+ dp = dsa_allocate0(area, sizeof(TidStoreControl));
+ ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+ ts->control->max_bytes = (size_t) (max_bytes * ratio);
+ ts->area = area;
+
+ ts->control->magic = TIDSTORE_MAGIC;
+ LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+ ts->control->handle = dp;
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+ }
+ else
+ {
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+ ts->control->max_bytes = max_bytes - (70 * 1024);
+ }
+
+ ts->control->max_off = max_off;
+ ts->control->max_off_nbits = pg_ceil_log2_32(max_off);
+
+ if (ts->control->max_off_nbits < LOWER_OFFSET_NBITS)
+ ts->control->max_off_nbits = LOWER_OFFSET_NBITS;
+
+ ts->control->upper_off_nbits =
+ ts->control->max_off_nbits - LOWER_OFFSET_NBITS;
+
+ return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+TidStoreAttach(dsa_area *area, TidStoreHandle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ /* create per-backend state */
+ ts = palloc0(sizeof(TidStore));
+
+ /* Find the control object in shared memory */
+ control = handle;
+
+ /* Set up the TidStore */
+ ts->control = (TidStoreControl *) dsa_get_address(area, control);
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+ ts->area = area;
+
+ return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+TidStoreDetach(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ shared_rt_detach(ts->tree.shared);
+ pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call TidStoreDetach() to free up backend-local memory associated
+ * with the TidStore. The backend that calls TidStoreDestroy() must not call
+ * TidStoreDetach().
+ */
+void
+TidStoreDestroy(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix
+ * tree.
+ */
+ ts->control->magic = 0;
+ dsa_free(ts->area, ts->control->handle);
+ shared_rt_free(ts->tree.shared);
+ }
+ else
+ {
+ pfree(ts->control);
+ local_rt_free(ts->tree.local);
+ }
+
+ pfree(ts);
+}
+
+/*
+ * Forget all collected Tids. It's similar to TidStoreDestroy() but we don't free
+ * entire TidStore but recreate only the radix tree storage.
+ */
+void
+TidStoreReset(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /*
+ * Free the radix tree and return allocated DSA segments to
+ * the operating system.
+ */
+ shared_rt_free(ts->tree.shared);
+ dsa_trim(ts->area);
+
+ /* Recreate the radix tree */
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+ LWTRANCHE_SHARED_TIDSTORE);
+
+ /* update the radix tree handle as we recreated it */
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+
+ LWLockRelease(&ts->control->lock);
+ }
+ else
+ {
+ local_rt_free(ts->tree.local);
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+ }
+}
+
+/*
+ * Set the given tids on the blkno to TidStore.
+ *
+ * NB: the offset numbers in offsets must be sorted in ascending order.
+ */
+void
+TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ offsetbm *bitmaps;
+ tidkey key;
+ tidkey prev_key;
+ offsetbm off_bitmap = 0;
+ int idx;
+ const tidkey key_base = ((uint64) blkno) << ts->control->upper_off_nbits;
+ const int nkeys = UINT64CONST(1) << ts->control->upper_off_nbits;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+ Assert(BlockNumberIsValid(blkno));
+
+ bitmaps = palloc(sizeof(offsetbm) * nkeys);
+ key = prev_key = key_base;
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ offsetbm off_bit;
+
+ Assert(offsets[i] <= ts->control->max_off);
+
+ /* encode the tid to a key and partial offset */
+ key = encode_blk_off(ts, blkno, offsets[i], &off_bit);
+
+ /* make sure we scanned the line pointer array in order */
+ Assert(key >= prev_key);
+
+ if (key > prev_key)
+ {
+ idx = prev_key - key_base;
+ Assert(idx >= 0 && idx < nkeys);
+
+ /* write out offset bitmap for this key */
+ bitmaps[idx] = off_bitmap;
+
+ /* zero out any gaps up to the current key */
+ for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
+ bitmaps[empty_idx] = 0;
+
+ /* reset for current key -- the current offset will be handled below */
+ off_bitmap = 0;
+ prev_key = key;
+ }
+
+ off_bitmap |= off_bit;
+ }
+
+ /* save the final index for later */
+ idx = key - key_base;
+ /* write out last offset bitmap */
+ bitmaps[idx] = off_bitmap;
+
+ if (TidStoreIsShared(ts))
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /* insert the calculated key-values to the tree */
+ for (int i = 0; i <= idx; i++)
+ {
+ if (bitmaps[i])
+ {
+ key = key_base + i;
+
+ if (TidStoreIsShared(ts))
+ shared_rt_set(ts->tree.shared, key, &bitmaps[i]);
+ else
+ local_rt_set(ts->tree.local, key, &bitmaps[i]);
+ }
+ }
+
+ /* update statistics */
+ ts->control->num_tids += num_offsets;
+
+ if (TidStoreIsShared(ts))
+ LWLockRelease(&ts->control->lock);
+
+ pfree(bitmaps);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+TidStoreIsMember(TidStore *ts, ItemPointer tid)
+{
+ tidkey key;
+ offsetbm off_bitmap = 0;
+ offsetbm off_bit;
+ bool found;
+
+ Assert(ItemPointerIsValid(tid));
+
+ key = encode_tid(ts, tid, &off_bit);
+
+ if (TidStoreIsShared(ts))
+ found = shared_rt_search(ts->tree.shared, key, &off_bitmap);
+ else
+ found = local_rt_search(ts->tree.local, key, &off_bitmap);
+
+ if (!found)
+ return false;
+
+ return (off_bitmap & off_bit) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during
+ * the iteration, so TidStoreEndIterate() needs to be called when finished.
+ *
+ * The TidStoreIter struct is created in the caller's memory context.
+ *
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
+ */
+TidStoreIter *
+TidStoreBeginIterate(TidStore *ts)
+{
+ TidStoreIter *iter;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ iter = palloc0(sizeof(TidStoreIter) +
+ sizeof(OffsetNumber) * ts->control->max_off);
+ iter->ts = ts;
+
+ if (TidStoreIsShared(ts))
+ iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+ else
+ iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+ /* If the TidStore is empty, there is no business */
+ if (TidStoreNumTids(ts) == 0)
+ iter->finished = true;
+
+ return iter;
+}
+
+static inline bool
+tidstore_iter(TidStoreIter *iter, tidkey *key, offsetbm *off_bitmap)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key, off_bitmap);
+
+ return local_rt_iterate_next(iter->tree_iter.local, key, off_bitmap);
+}
+
+/*
+ * Scan the TidStore and return a pointer to TidStoreIterResult that has tids
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+TidStoreIterateNext(TidStoreIter *iter)
+{
+ tidkey key;
+ offsetbm off_bitmap = 0;
+ TidStoreIterResult *output = &(iter->output);
+
+ if (iter->finished)
+ return NULL;
+
+ /* Initialize the outputs */
+ output->blkno = InvalidBlockNumber;
+ output->num_offsets = 0;
+
+ /*
+ * Decode the key and offset bitmap that are collected in the previous
+ * time, if exists.
+ */
+ if (iter->next_off_bitmap > 0)
+ iter_decode_key_off(iter, iter->next_tidkey, iter->next_off_bitmap);
+
+ while (tidstore_iter(iter, &key, &off_bitmap))
+ {
+ BlockNumber blkno = key_get_blkno(iter->ts, key);
+ Assert(BlockNumberIsValid(blkno));
+
+ if (BlockNumberIsValid(output->blkno) && output->blkno != blkno)
+ {
+ /*
+ * We got tids for a different block. We return the collected
+ * tids so far, and remember the key-value for the next
+ * iteration.
+ */
+ iter->next_tidkey = key;
+ iter->next_off_bitmap = off_bitmap;
+ return output;
+ }
+
+ /* Collect tids decoded from the key and offset bitmap */
+ iter_decode_key_off(iter, key, off_bitmap);
+ }
+
+ iter->finished = true;
+ return output;
+}
+
+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
+void
+TidStoreEndIterate(TidStoreIter *iter)
+{
+ if (TidStoreIsShared(iter->ts))
+ shared_rt_end_iterate(iter->tree_iter.shared);
+ else
+ local_rt_end_iterate(iter->tree_iter.local);
+
+ pfree(iter);
+}
+
+/* Return the number of tids we collected so far */
+int64
+TidStoreNumTids(TidStore *ts)
+{
+ int64 num_tids;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ if (!TidStoreIsShared(ts))
+ return ts->control->num_tids;
+
+ LWLockAcquire(&ts->control->lock, LW_SHARED);
+ num_tids = ts->control->num_tids;
+ LWLockRelease(&ts->control->lock);
+
+ Assert(num_tids >= 0);
+ return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+TidStoreIsFull(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return (TidStoreMemoryUsage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+TidStoreMaxMemory(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+TidStoreMemoryUsage(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+ return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+TidStoreHandle
+TidStoreGetHandle(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->handle;
+}
+
+/*
+ * Decode the key and offset bitmap to tids and store them to the iteration
+ * result.
+ */
+static void
+iter_decode_key_off(TidStoreIter *iter, tidkey key, offsetbm off_bitmap)
+{
+ TidStoreIterResult *output = (&iter->output);
+
+ while (off_bitmap)
+ {
+ uint64 compressed_tid;
+ OffsetNumber off;
+
+ compressed_tid = key << LOWER_OFFSET_NBITS;
+ compressed_tid |= pg_rightmost_one_pos64(off_bitmap);
+
+ off = compressed_tid & ((UINT64CONST(1) << iter->ts->control->max_off_nbits) - 1);
+
+ Assert(output->num_offsets < iter->ts->control->max_off);
+ output->offsets[output->num_offsets++] = off;
+
+ /* unset the rightmost bit */
+ off_bitmap &= ~pg_rightmost_one64(off_bitmap);
+ }
+
+ output->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, tidkey key)
+{
+ return (BlockNumber) (key >> ts->control->upper_off_nbits);
+}
+
+/* Encode a tid to key and partial offset */
+static inline tidkey
+encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit)
+{
+ OffsetNumber offset = ItemPointerGetOffsetNumber(tid);
+ BlockNumber block = ItemPointerGetBlockNumber(tid);
+
+ return encode_blk_off(ts, block, offset, off_bit);
+}
+
+/* encode a block and offset to a key and partial offset */
+static inline tidkey
+encode_blk_off(TidStore *ts, BlockNumber block, OffsetNumber offset,
+ offsetbm *off_bit)
+{
+ tidkey key;
+ uint64 compressed_tid;
+ uint32 off_lower;
+
+ off_lower = offset & LOWER_OFFSET_MASK;
+ Assert(off_lower < (sizeof(offsetbm) * BITS_PER_BYTE));
+
+ *off_bit = UINT64CONST(1) << off_lower;
+ compressed_tid = offset | ((uint64) block << ts->control->max_off_nbits);
+ key = compressed_tid >> LOWER_OFFSET_NBITS;
+
+ return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
"SharedTupleStore",
/* LWTRANCHE_SHARED_TIDBITMAP: */
"SharedTidBitmap",
+ /* LWTRANCHE_SHARED_TIDSTORE: */
+ "SharedTidStore",
/* LWTRANCHE_PARALLEL_APPEND: */
"ParallelAppend",
/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..d1cc93cbb6
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,50 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer TidStoreHandle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+/* Result struct for TidStoreIterateNext */
+typedef struct TidStoreIterResult
+{
+ BlockNumber blkno;
+ int num_offsets;
+ OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
+} TidStoreIterResult;
+
+extern TidStore *TidStoreCreate(size_t max_bytes, OffsetNumber max_off, dsa_area *dsa);
+extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer handle);
+extern void TidStoreDetach(TidStore *ts);
+extern void TidStoreDestroy(TidStore *ts);
+extern void TidStoreReset(TidStore *ts);
+extern void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool TidStoreIsMember(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * TidStoreBeginIterate(TidStore *ts);
+extern TidStoreIterResult *TidStoreIterateNext(TidStoreIter *iter);
+extern void TidStoreEndIterate(TidStoreIter *iter);
+extern int64 TidStoreNumTids(TidStore *ts);
+extern bool TidStoreIsFull(TidStore *ts);
+extern size_t TidStoreMaxMemory(TidStore *ts);
+extern size_t TidStoreMemoryUsage(TidStore *ts);
+extern TidStoreHandle TidStoreGetHandle(TidStore *ts);
+
+#endif /* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
LWTRANCHE_SHARED_TUPLESTORE,
LWTRANCHE_SHARED_TIDBITMAP,
+ LWTRANCHE_SHARED_TIDSTORE,
LWTRANCHE_PARALLEL_APPEND,
LWTRANCHE_PER_XACT_PREDICATE_LIST,
LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
test_rls_hooks \
test_shm_mq \
test_slru \
+ test_tidstore \
unsafe_tests \
worker_spi
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
subdir('test_rls_hooks')
subdir('test_shm_mq')
subdir('test_slru')
+subdir('test_tidstore')
subdir('unsafe_tests')
subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+ $(WIN32RES) \
+ test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE: testing empty tidstore
+NOTICE: testing basic operations
+ test_tidstore
+---------------
+
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+ 'test_tidstore.c',
+)
+
+if host_system == 'windows'
+ test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_tidstore',
+ '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+ test_tidstore_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+ 'test_tidstore.control',
+ 'test_tidstore--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_tidstore',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_tidstore',
+ ],
+ },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..8659e6780e
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,228 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ * Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+/* #define TEST_SHARED_TIDSTORE 1 */
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+ ItemPointerData tid;
+ bool found;
+
+ ItemPointerSet(&tid, blkno, off);
+
+ found = TidStoreIsMember(ts, &tid);
+
+ if (found != expect)
+ elog(ERROR, "TidStoreIsMember for TID (%u, %u) returned %d, expected %d",
+ blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS 5
+#define TEST_TIDSTORE_NUM_OFFSETS 5
+
+ TidStore *ts;
+ TidStoreIter *iter;
+ TidStoreIterResult *iter_result;
+ BlockNumber blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+ };
+ BlockNumber blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+ };
+ OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+ int blk_idx;
+
+#ifdef TEST_SHARED_TIDSTORE
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_tidstore");
+ dsa = dsa_create(tranche_id);
+
+ ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
+#else
+ ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+#endif
+
+ /* prepare the offset array */
+ offs[0] = FirstOffsetNumber;
+ offs[1] = FirstOffsetNumber + 1;
+ offs[2] = max_offset / 2;
+ offs[3] = max_offset - 1;
+ offs[4] = max_offset;
+
+ /* add tids */
+ for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+ TidStoreSetBlockOffsets(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* lookup test */
+ for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+ {
+ bool expect = false;
+ for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+ {
+ if (offs[i] == off)
+ {
+ expect = true;
+ break;
+ }
+ }
+
+ check_tid(ts, 0, off, expect);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, expect);
+ }
+
+ /* test the number of tids */
+ if (TidStoreNumTids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+ elog(ERROR, "TidStoreNumTids returned " UINT64_FORMAT ", expected %d",
+ TidStoreNumTids(ts),
+ TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* iteration test */
+ iter = TidStoreBeginIterate(ts);
+ blk_idx = 0;
+ while ((iter_result = TidStoreIterateNext(iter)) != NULL)
+ {
+ /* check the returned block number */
+ if (blks_sorted[blk_idx] != iter_result->blkno)
+ elog(ERROR, "TidStoreIterateNext returned block number %u, expected %u",
+ iter_result->blkno, blks_sorted[blk_idx]);
+
+ /* check the returned offset numbers */
+ if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+ elog(ERROR, "TidStoreIterateNext %u offsets, expected %u",
+ iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+ for (int i = 0; i < iter_result->num_offsets; i++)
+ {
+ if (offs[i] != iter_result->offsets[i])
+ elog(ERROR, "TidStoreIterateNext offset number %u on block %u, expected %u",
+ iter_result->offsets[i], iter_result->blkno, offs[i]);
+ }
+
+ blk_idx++;
+ }
+
+ if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+ elog(ERROR, "TidStoreIterateNext returned %d blocks, expected %d",
+ blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+ /* remove all tids */
+ TidStoreReset(ts);
+
+ /* test the number of tids */
+ if (TidStoreNumTids(ts) != 0)
+ elog(ERROR, "TidStoreNumTids on empty store returned non-zero");
+
+ /* lookup test for empty store */
+ for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+ off++)
+ {
+ check_tid(ts, 0, off, false);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, false);
+ }
+
+ TidStoreDestroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+ dsa_detach(dsa);
+#endif
+}
+
+static void
+test_empty(void)
+{
+ TidStore *ts;
+ TidStoreIter *iter;
+ ItemPointerData tid;
+
+#ifdef TEST_SHARED_TIDSTORE
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_tidstore");
+ dsa = dsa_create(tranche_id);
+
+ ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
+#else
+ ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+#endif
+
+ elog(NOTICE, "testing empty tidstore");
+
+ ItemPointerSet(&tid, 0, FirstOffsetNumber);
+ if (TidStoreIsMember(ts, &tid))
+ elog(ERROR, "TidStoreIsMember for TID (%u,%u) on empty store returned true",
+ 0, FirstOffsetNumber);
+
+ ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+ if (TidStoreIsMember(ts, &tid))
+ elog(ERROR, "TidStoreIsMember for TID (%u,%u) on empty store returned true",
+ MaxBlockNumber, MaxOffsetNumber);
+
+ if (TidStoreNumTids(ts) != 0)
+ elog(ERROR, "TidStoreNumTids on empty store returned non-zero");
+
+ if (TidStoreIsFull(ts))
+ elog(ERROR, "TidStoreIsFull on empty store returned true");
+
+ iter = TidStoreBeginIterate(ts);
+
+ if (TidStoreIterateNext(iter) != NULL)
+ elog(ERROR, "TidStoreIterateNext on empty store returned TIDs");
+
+ TidStoreEndIterate(iter);
+
+ TidStoreDestroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+ dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ elog(NOTICE, "testing basic operations");
+ test_basic(MaxHeapTuplesPerPage);
+ test_basic(10);
+ test_basic(MaxHeapTuplesPerPage * 2);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
--
2.31.1
v30-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchapplication/octet-stream; name=v30-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchDownload
From 1feaf4249814a4bb7c5683649130b16cf3e5c754 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v30 05/11] Tool for measuring radix tree and tidstore
performance
Includes Meson support, but commented out to avoid warnings
XXX: Not for commit
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 88 +++
contrib/bench_radix_tree/bench_radix_tree.c | 747 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/meson.build | 33 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
contrib/meson.build | 1 +
8 files changed, 925 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/meson.build
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..ad66265e23
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,88 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_tidstore_load(
+minblk int4,
+maxblk int4,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT iter_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..63e842395d
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,747 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+//#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+PG_FUNCTION_INFO_V1(bench_tidstore_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+Datum
+bench_tidstore_load(PG_FUNCTION_ARGS)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ TidStore *ts;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
+ OffsetNumber *offs;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_ms;
+ int64 iter_ms;
+ TupleDesc tupdesc;
+ Datum values[3];
+ bool nulls[3] = {false};
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ offs = palloc(sizeof(OffsetNumber) * TIDS_PER_BLOCK_FOR_LOAD);
+ for (int i = 0; i < TIDS_PER_BLOCK_FOR_LOAD; i++)
+ offs[i] = i + 1; /* FirstOffsetNumber is 1 */
+
+ ts = TidStoreCreate(1 * 1024L * 1024L * 1024L, MaxHeapTuplesPerPage, NULL);
+
+ /* load tids */
+ start_time = GetCurrentTimestamp();
+ for (BlockNumber blkno = minblk; blkno < maxblk; blkno++)
+ TidStoreSetBlockOffsets(ts, blkno, offs, TIDS_PER_BLOCK_FOR_LOAD);
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_ms = secs * 1000 + usecs / 1000;
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* iterate through tids */
+ iter = TidStoreBeginIterate(ts);
+ start_time = GetCurrentTimestamp();
+ while ((result = TidStoreIterateNext(iter)) != NULL)
+ ;
+ TidStoreEndIterate(iter);
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ iter_ms = secs * 1000 + usecs / 1000;
+
+ values[0] = Int64GetDatum(TidStoreMemoryUsage(ts));
+ values[1] = Int64GetDatum(load_ms);
+ values[2] = Int64GetDatum(iter_ms);
+
+ TidStoreDestroy(ts);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ rt_radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, &val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, &val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, &key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ int64 search_time_ms;
+ Datum values[3] = {0};
+ bool nulls[3] = {0};
+ /* from trial and error */
+ uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (!PG_ARGISNULL(1))
+ {
+ char *filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+ if (sscanf(filter_str, "0x%lX", &filter) == 0)
+ elog(ERROR, "invalid filter string %s", filter_str);
+ }
+ elog(NOTICE, "bench with filter 0x%lX", filter);
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 hash = hash64(i);
+ uint64 key = hash & filter;
+
+ rt_set(rt, key, &key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 hash = hash64(i);
+ uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+ values[2] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ uint64 key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, &key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_sparseload_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ uint64 r,
+ h;
+ uint64 key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ key_id = 0;
+
+ for (r = 1; r <= fanout; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (h);
+
+ key_id++;
+ rt_set(rt, key, &key_id);
+ }
+ }
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+
+ /* measure sparse deletion and re-loading */
+ start_time = GetCurrentTimestamp();
+
+ for (int t = 0; t<10000; t++)
+ {
+ /* delete one key in each leaf */
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ rt_delete(rt, key);
+ }
+
+ /* add them all back */
+ key_id = 0;
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ key_id++;
+ rt_set(rt, key, &key_id);
+ }
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_sparseload_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* to silence warnings about unused iter functions */
+static void pg_attribute_unused()
+stub_iter()
+{
+ rt_radix_tree *rt;
+ rt_iter *iter;
+ uint64 key = 1;
+ uint64 value = 1;
+
+ rt = rt_create(CurrentMemoryContext);
+
+ iter = rt_begin_iterate(rt);
+ rt_iterate_next(iter, &key, &value);
+ rt_end_iterate(iter);
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+ 'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+ bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'bench_radix_tree',
+ '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+ bench_radix_tree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+ 'bench_radix_tree.control',
+ 'bench_radix_tree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'bench_radix_tree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'bench_radix_tree',
+ ],
+ },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..421d469f8c 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
subdir('auth_delay')
subdir('auto_explain')
subdir('basic_archive')
+subdir('bench_radix_tree')
subdir('bloom')
subdir('basebackup_to_shell')
subdir('bool_plperl')
--
2.31.1
v30-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchapplication/octet-stream; name=v30-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload
From 756f0a7a1f3e9030ddc68ae635baa25c4a310b4d Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v30 02/11] Move some bitmap logic out of bitmapset.c
Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.
Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().
Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.
Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
src/backend/nodes/bitmapset.c | 36 ++------------------------------
src/include/nodes/bitmapset.h | 16 ++++++++++++--
src/include/port/pg_bitutils.h | 31 +++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 -
4 files changed, 47 insertions(+), 37 deletions(-)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x))
/*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
{
int result;
- w = RIGHTMOST_ONE(w);
+ w = bmw_rightmost_one(w);
a->words[wordnum] &= ~w;
result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 3d2225e1ae..5f9a511b4a 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
#else
#define BITS_PER_BITMAPWORD 32
typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
#endif
@@ -75,6 +73,20 @@ typedef enum
BMS_MULTIPLE /* >1 member */
} BMS_Membership;
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w) pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
+#define bmw_popcount(w) pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w) pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
+#define bmw_popcount(w) pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
/*
* function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 158ef73a2b..bf7588e075 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -32,6 +32,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+ int32 result = (int32) word & -((int32) word);
+ return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+ int64 result = (int64) word & -((int64) word);
+ return (uint64) result;
+}
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 86a9303bf5..4a5e776703 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3675,7 +3675,6 @@ shmem_request_hook_type
shmem_startup_hook_type
sig_atomic_t
sigjmp_buf
-signedbitmapword
sigset_t
size_t
slist_head
--
2.31.1
v30-0003-Add-a-macro-templatized-radix-tree.patchapplication/octet-stream; name=v30-0003-Add-a-macro-templatized-radix-tree.patchDownload
From 87b21d222bc9e2b8bdbd6cb7c880d1f5a5192242 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v30 03/11] Add a macro templatized radix tree.
The radix tree data structure is implemented based on the idea from
the paper "The Adaptive Radix Tree: ARTful Indexing for Main-Memory
Databases" by Viktor Leis, Alfons Kemper, and Thomas Neumann,
2013. There are some optimizations that are proposed in the ART paper
not yet implemented, particularly path compression and lazy path
expansion. For a better performance, the radix trees have to be
adjusted to the individual user-case at compile-time. So the radix
tree is implemented using a macro-templatized header file, which
generates functions and types based on a prefix and other parameters.
The key of radix tree is 64-bit unsigned integer but the caller can
specify the type of the value. Our main innovation implemented
in our radix tree implementation compared to the ART paper is
decoupling the notion of size class from kind. The size classes within
a given node kind have the same underlying type, but a variable number
of children/values. Nodes of different kinds necessarily belong to
different size classes. Growing from one node kind to another requires
special code for each case, but growing from one size class to another
within the same kind is basically just allocate + memcpy.
The radix tree can be created also on a DSA area. To handle
concurrency, we use a single reader-writer lock for the radix
tree. The current locking mechanism is not optimized for high
concurrency with mixed read-write workloads. In the future it might be
worthwhile to replace it with the Optimistic Lock Coupling or ROWEX
mentioned in the paper "The ART of Practical Synchronization" by the
same authors as the ART paper, 2016.
Later patches use this infrastructure to use such radix tree for
storing dead tuple TIDs during lazy vacuum.There are possible cases
where this could be useful (e.g., replacement for hash table for
shared buffer).
This includes a unit test module, in src/test/modules/test_radixtree.
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA@mail.gmail.com
---
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 2523 +++++++++++++++++
src/include/lib/radixtree_delete_impl.h | 122 +
src/include/lib/radixtree_insert_impl.h | 328 +++
src/include/lib/radixtree_iter_impl.h | 144 +
src/include/lib/radixtree_search_impl.h | 138 +
src/include/utils/dsa.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 38 +
src/test/modules/test_radixtree/meson.build | 35 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 712 +++++
.../test_radixtree/test_radixtree.control | 4 +
src/tools/pginclude/cpluspluscheck | 6 +
src/tools/pginclude/headerscheck | 6 +
20 files changed, 4120 insertions(+)
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/include/lib/radixtree_delete_impl.h
create mode 100644 src/include/lib/radixtree_insert_impl.h
create mode 100644 src/include/lib/radixtree_iter_impl.h
create mode 100644 src/include/lib/radixtree_search_impl.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..80555aefff 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..2e3963c3d5
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2523 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Template for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ * tional leaf node type which stores one value.
+ * - Multi-value leaves: The values are stored in one of four
+ * different leaf node types, which mirror the structure of
+ * inner nodes, but contain values instead of pointers.
+ * - Combined pointer/value slots: If values fit into point-
+ * ers, no separate node types are necessary. Instead, each
+ * pointer storage location in an inner node can either
+ * store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * To handle concurrency, we use a single reader-writer lock for the radix
+ * tree. The radix tree is exclusively locked during write operations such
+ * as RT_SET() and RT_DELETE(), and shared locked during read operations
+ * such as RT_SEARCH(). An iteration also holds the shared lock on the radix
+ * tree until it is completed.
+ *
+ * TODO: The current locking mechanism is not optimized for high concurrency
+ * with mixed read-write workloads. In the future it might be worthwhile
+ * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
+ * the paper "The ART of Practical Synchronization" by the same authors as
+ * the ART paper, 2016.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included. Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * will result in radix tree type 'foo_radix_tree' and functions like
+ * 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ * generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ * declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
+ *
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ * so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE - Create a new, empty radix tree
+ * RT_FREE - Free the radix tree
+ * RT_SEARCH - Search a key-value pair
+ * RT_SET - Set a key-value pair
+ * RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT - Return next key-value pair, if any
+ * RT_END_ITERATE - End iteration
+ * RT_MEMORY_USAGE - Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH - Attach to the radix tree
+ * RT_DETACH - Detach from the radix tree
+ * RT_GET_HANDLE - Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE - Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND_UP RT_MAKE_NAME(extend_up)
+#define RT_EXTEND_DOWN RT_MAKE_NAME(extend_down)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_ITER_SET_NODE_FROM RT_MAKE_NAME(iter_set_node_from)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif /* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define RT_BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
+#define RT_BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ * statements.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ * in the future to tag the node pointer with the kind, even on
+ * platforms with 32-bit pointers. This might speed up node traversal
+ * in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_3 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Max capacity for the current size class. Storing this in the
+ * node enables multiple size classes per node kind.
+ * Technically, kinds with a single size class don't need this, so we could
+ * keep this in the individual base types, but the code is simpler this way.
+ * Note: node256 is unique in that it cannot possibly have more than a
+ * single size class, so for that kind we store zero, and uint8 is
+ * sufficient for other kinds.
+ */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree) LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree) LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree) LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree) ((void) 0)
+#define RT_LOCK_SHARED(tree) ((void) 0)
+#define RT_UNLOCK(tree) ((void) 0)
+#endif
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+#define RT_NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
+
+#define RT_NODE_MUST_GROW(node) \
+ ((node)->base.n.count == (node)->base.n.fanout)
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_3
+{
+ RT_NODE n;
+
+ /* 3 children, for key chunks */
+ uint8 chunks[3];
+} RT_NODE_BASE_3;
+
+typedef struct RT_NODE_BASE_32
+{
+ RT_NODE n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_125
+{
+ RT_NODE n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* bitmap to track which slots are in use */
+ bitmapword isset[RT_BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+ RT_NODE n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * These are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_3
+{
+ RT_NODE_BASE_3 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_3;
+
+typedef struct RT_NODE_LEAF_3
+{
+ RT_NODE_BASE_3 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_3;
+
+typedef struct RT_NODE_INNER_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+ RT_NODE_BASE_256 base;
+
+ /* Slots for 256 children */
+ RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+ RT_NODE_BASE_256 base;
+
+ /*
+ * Unlike with inner256, zero is a valid value here, so we use a
+ * bitmap to track which slots are in use.
+ */
+ bitmapword isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ RT_VALUE_TYPE values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+ RT_CLASS_3 = 0,
+ RT_CLASS_32_MIN,
+ RT_CLASS_32_MAX,
+ RT_CLASS_125,
+ RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+} RT_SIZE_CLASS_ELEM;
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+ [RT_CLASS_3] = {
+ .name = "radix tree node 3",
+ .fanout = 3,
+ .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_32_MIN] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_32_MAX] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_125] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(RT_NODE_INNER_256),
+ .leaf_size = sizeof(RT_NODE_LEAF_256),
+ },
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+ RT_HANDLE handle;
+ uint32 magic;
+ LWLock lock;
+#endif
+
+ RT_PTR_ALLOC root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+ MemoryContext context;
+
+ /* pointing to either local memory or DSA */
+ RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+ dsa_area *dsa;
+#else
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key.
+ *
+ * RT_NODE_ITER is the struct for iteration of one radix tree node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * for each level to track the iteration within the node.
+ */
+typedef struct RT_NODE_ITER
+{
+ /*
+ * Local pointer to the node we are iterating over.
+ *
+ * Since the radix tree doesn't support the shared iteration among multiple
+ * processes, we use RT_PTR_LOCAL rather than RT_PTR_ALLOC.
+ */
+ RT_PTR_LOCAL node;
+
+ /*
+ * The next index of the chunk array in RT_NODE_KIND_3 and
+ * RT_NODE_KIND_32 nodes, or the next chunk in RT_NODE_KIND_125 and
+ * RT_NODE_KIND_256 nodes. 0 for the initial value.
+ */
+ int idx;
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+ RT_RADIX_TREE *tree;
+
+ /* Track the nodes for each level. level = 0 is for a leaf node */
+ RT_NODE_ITER node_iters[RT_MAX_LEVEL];
+ int top_level;
+
+ /* The key constructed during the iteration */
+ uint64 key;
+} RT_ITER;
+
+
+static void RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_VALUE_TYPE *value_p);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+ return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+ return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+ return DsaPointerIsValid(ptr);
+#else
+ return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ /* replicate the search key */
+ spread_chunk = vector8_broadcast(chunk);
+
+ /* compare to all 32 keys stored in the node */
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+
+ /* convert comparison to a bitfield */
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+ /* mask off invalid entries */
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ /* convert bitfield to index by counting trailing zeros */
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ /*
+ * This is coded with '>=' to match what we can do with SIMD,
+ * with an assert to keep us honest.
+ */
+ if (node->chunks[index] >= chunk)
+ {
+ Assert(node->chunks[index] != chunk);
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ /*
+ * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+ * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+ * we need to play some trickery using vector8_min() to effectively get
+ * >=. There'll never be any equal elements in current uses, but that's
+ * what we get here...
+ */
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+ uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+ uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+ Assert(RT_NODE_IS_LEAF(node));
+ Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+ return node->children[chunk];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ Assert(RT_NODE_IS_LEAF(node));
+ Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ node->isset[idx] |= ((bitmapword) 1 << bitnum);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+ if (key == 0)
+ return 0;
+ else
+ return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+ RT_PTR_ALLOC allocnode;
+ size_t allocsize;
+
+ if (is_leaf)
+ allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+ else
+ allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+ allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+ if (is_leaf)
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ allocsize);
+ else
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+ allocsize);
+#endif
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->ctl->cnt[size_class]++;
+#endif
+
+ return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+ if (is_leaf)
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+ else
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+
+ node->kind = kind;
+
+ if (kind == RT_NODE_KIND_256)
+ /* See comment for the RT_NODE type */
+ Assert(node->fanout == 0);
+ else
+ node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+ memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
+ }
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static pg_noinline void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+ int shift = RT_KEY_GET_SHIFT(key);
+ bool is_leaf = shift == 0;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ newnode->shift = shift;
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+ tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->count = oldnode->count;
+}
+
+/*
+ * Given a new allocated node and an old node, initialize the new
+ * node with the necessary fields and return its local pointer.
+ */
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+ uint8 new_kind, uint8 new_class, bool is_leaf)
+{
+ RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
+ RT_COPY_NODE(newnode, node);
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->ctl->root == allocnode)
+ {
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+ tree->ctl->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->ctl->cnt[i]--;
+ Assert(tree->ctl->cnt[i] >= 0);
+ }
+#endif
+
+#ifdef RT_SHMEM
+ dsa_free(tree->dsa, allocnode);
+#else
+ pfree(allocnode);
+#endif
+}
+
+/* Update the parent's pointer when growing a node */
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static inline void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
+ RT_PTR_ALLOC new_child, uint64 key)
+{
+#ifdef USE_ASSERT_CHECKING
+ RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+ Assert(old_child->shift == new->shift);
+ Assert(old_child->count == new->count);
+#endif
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new larger node */
+ tree->ctl->root = new_child;
+ }
+ else
+ RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+ RT_FREE_NODE(tree, stored_old_child);
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static pg_noinline void
+RT_EXTEND_UP(RT_RADIX_TREE *tree, uint64 key)
+{
+ int target_shift;
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ int shift = root->shift + RT_NODE_SPAN;
+
+ target_shift = RT_KEY_GET_SHIFT(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ RT_NODE_INNER_3 *n3;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+ node->shift = shift;
+ node->count = 1;
+
+ n3 = (RT_NODE_INNER_3 *) node;
+ n3->base.chunks[0] = 0;
+ n3->children[0] = tree->ctl->root;
+
+ /* Update the root */
+ tree->ctl->root = allocnode;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static pg_noinline void
+RT_EXTEND_DOWN(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
+{
+ int shift = node->shift;
+
+ Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ RT_PTR_ALLOC allocchild;
+ RT_PTR_LOCAL newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool is_leaf = newshift == 0;
+
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ newchild->shift = newshift;
+ RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
+
+ parent = node;
+ node = newchild;
+ stored_node = allocchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value_p);
+ tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static void
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+ RT_RADIX_TREE *tree;
+ MemoryContext old_ctx;
+#ifdef RT_SHMEM
+ dsa_pointer dp;
+#endif
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+ tree->context = ctx;
+
+#ifdef RT_SHMEM
+ tree->dsa = dsa;
+ dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+ tree->ctl->handle = dp;
+ tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+ LWLockInitialize(&tree->ctl->lock, tranche_id);
+#else
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+ /* Create a slab context for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+ size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+ size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ size_class.name,
+ inner_blocksize,
+ size_class.inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ size_class.name,
+ leaf_blocksize,
+ size_class.leaf_size);
+ }
+#endif
+
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+ RT_RADIX_TREE *tree;
+ dsa_pointer control;
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ tree->dsa = dsa;
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();
+
+ /* The leaf node doesn't have child pointers */
+ if (RT_NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->dsa, ptr);
+ return;
+ }
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+ for (int i = 0; i < n3->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n3->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ for (int i = 0; i < n32->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+ }
+
+ break;
+ }
+ }
+
+ /* Free the inner node */
+ dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ /* Free all memory used for radix tree nodes */
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_FREE_RECURSE(tree, tree->ctl->root);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->dsa, tree->ctl->handle);
+#else
+ pfree(tree->ctl);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+#endif
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+ int shift;
+ bool updated;
+ RT_PTR_LOCAL parent;
+ RT_PTR_ALLOC stored_child;
+ RT_PTR_LOCAL child;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ RT_LOCK_EXCLUSIVE(tree);
+
+ /* Empty tree, create the root */
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_NEW_ROOT(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->ctl->max_val)
+ RT_EXTEND_UP(tree, key);
+
+ stored_child = tree->ctl->root;
+ parent = RT_PTR_GET_LOCAL(tree, stored_child);
+ shift = parent->shift;
+
+ /* Descend the tree until we reach a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;
+
+ child = RT_PTR_GET_LOCAL(tree, stored_child);
+
+ if (RT_NODE_IS_LEAF(child))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
+ {
+ RT_EXTEND_DOWN(tree, key, value_p, parent, stored_child, child);
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ parent = child;
+ stored_child = new_child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value_p);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->ctl->num_keys++;
+
+ RT_UNLOCK(tree);
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *value_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+ bool found;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+ Assert(value_p != NULL);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ shift = node->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+ if (RT_NODE_IS_LEAF(node))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ node = RT_PTR_GET_LOCAL(tree, child);
+ shift -= RT_NODE_SPAN;
+ }
+
+ found = RT_NODE_SEARCH_LEAF(node, key, value_p);
+
+ RT_UNLOCK(tree);
+ return found;
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ RT_LOCK_EXCLUSIVE(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+ /* Push the current node to the stack */
+ stack[++level] = allocnode;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ allocnode = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_LEAF(node, key);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->ctl->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (node->count > 0)
+ {
+ RT_UNLOCK(tree);
+ return true;
+ }
+
+ /* Free the empty leaf node */
+ RT_FREE_NODE(tree, allocnode);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ allocnode = stack[level--];
+
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_INNER(node, key);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (node->count > 0)
+ break;
+
+ /* The node became empty */
+ RT_FREE_NODE(tree, allocnode);
+ }
+
+ RT_UNLOCK(tree);
+ return true;
+}
+#endif
+
+/*
+ * Scan the inner node and return the next child node if exist, otherwise
+ * return NULL.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Scan the leaf node, and return true and the next value is set to value_p
+ * if exists. Otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+ RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * While descending the radix tree from the 'from' node to the bottom, we
+ * set the next node to iterate for each level.
+ */
+static void
+RT_ITER_SET_NODE_FROM(RT_ITER *iter, RT_PTR_LOCAL from)
+{
+ int level = from->shift / RT_NODE_SPAN;
+ RT_PTR_LOCAL node = from;
+
+ for (;;)
+ {
+ RT_NODE_ITER *node_iter = &(iter->node_iters[level--]);
+
+#ifdef USE_ASSERT_CHECKING
+ if (node_iter->node)
+ {
+ /* We must have finished the iteration on the previous node */
+ if (RT_NODE_IS_LEAF(node_iter->node))
+ {
+ uint64 dummy;
+ Assert(!RT_NODE_LEAF_ITERATE_NEXT(iter, node_iter, &dummy));
+ }
+ else
+ Assert(!RT_NODE_INNER_ITERATE_NEXT(iter, node_iter));
+ }
+#endif
+
+ /* Set the node to the node iterator of this level */
+ node_iter->node = node;
+ node_iter->idx = 0;
+
+ if (RT_NODE_IS_LEAF(node))
+ {
+ /* We will visit the leaf node when RT_ITERATE_NEXT() */
+ break;
+ }
+
+ /*
+ * Get the first child node from the node, which corresponds to the
+ * lowest chunk within the node.
+ */
+ node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+ /* The first child must be found */
+ Assert(node);
+ }
+}
+
+/*
+ * Create and return the iterator for the given radix tree.
+ *
+ * The radix tree is locked in shared mode during the iteration, so
+ * RT_END_ITERATE needs to be called when finished to release the lock.
+ */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+ RT_ITER *iter;
+ RT_PTR_LOCAL root;
+
+ iter = (RT_ITER *) MemoryContextAllocZero(tree->context,
+ sizeof(RT_ITER));
+ iter->tree = tree;
+
+ RT_LOCK_SHARED(tree);
+
+ /* empty tree */
+ if (!iter->tree->ctl->root)
+ return iter;
+
+ root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+ iter->top_level = root->shift / RT_NODE_SPAN;
+
+ /*
+ * Set the next node to iterate for each level from the level of the
+ * root node.
+ */
+ RT_ITER_SET_NODE_FROM(iter, root);
+
+ return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+ Assert(value_p != NULL);
+
+ /* Empty tree */
+ if (!iter->tree->ctl->root)
+ return false;
+
+ for (;;)
+ {
+ RT_PTR_LOCAL child = NULL;
+
+ /* Get the next chunk of the leaf node */
+ if (RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->node_iters[0]), value_p))
+ {
+ *key_p = iter->key;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance all inner node
+ * iterators by visiting inner nodes from the level = 1 until we find the
+ * next inner node that has a child node.
+ */
+ for (int level = 1; level <= iter->top_level; level++)
+ {
+ child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->node_iters[level]));
+
+ if (child)
+ break;
+ }
+
+ /* We've visited all nodes, so the iteration finished */
+ if (!child)
+ break;
+
+ /*
+ * Found the new child node. We update the next node to iterate for each
+ * level from the level of this child node.
+ */
+ RT_ITER_SET_NODE_FROM(iter, child);
+
+ /* Find key-value from the leaf node again */
+ }
+
+ return false;
+}
+
+/*
+ * Terminate the iteration and release the lock.
+ *
+ * This function needs to be called after finishing or when exiting an
+ * iteration.
+ */
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+#ifdef RT_SHMEM
+ Assert(LWLockHeldByMe(&iter->tree->ctl->lock));
+#endif
+
+ RT_UNLOCK(iter->tree);
+ pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+ Size total = 0;
+
+ RT_LOCK_SHARED(tree);
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ total = dsa_get_total_size(tree->dsa);
+#else
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+#endif
+
+ RT_UNLOCK(tree);
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
+
+ for (int i = 1; i < n3->n.count; i++)
+ Assert(n3->chunks[i - 1] < n3->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ uint8 slot = n125->slot_idxs[i];
+ int idx = RT_BM_IDX(slot);
+ int bitnum = RT_BM_BIT(slot);
+
+ if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(slot < node->fanout);
+ Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
+ cnt += bmw_popcount(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+ RT_LOCK_SHARED(tree);
+
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+ fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+ fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+
+ fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+ root->shift / RT_NODE_SPAN,
+ tree->ctl->cnt[RT_CLASS_3],
+ tree->ctl->cnt[RT_CLASS_32_MIN],
+ tree->ctl->cnt[RT_CLASS_32_MAX],
+ tree->ctl->cnt[RT_CLASS_125],
+ tree->ctl->cnt[RT_CLASS_256]);
+ }
+
+ RT_UNLOCK(tree);
+}
+
+static void
+RT_DUMP_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, int level,
+ bool recurse, StringInfo buf)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+ StringInfoData spaces;
+
+ initStringInfo(&spaces);
+ appendStringInfoSpaces(&spaces, (level * 4) + 1);
+
+ appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u, shift %u:\n",
+ spaces.data,
+ level == 0 ? "" : "-> ",
+ RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_3) ? 3 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+ spaces.data, i, n3->base.chunks[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X",
+ spaces.data, i, n3->base.chunks[i]);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, n3->children[i], level + 1,
+ recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+ spaces.data, i, n32->base.chunks[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X",
+ spaces.data, i, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, n32->children[i], level + 1,
+ recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+ char *sep = "";
+
+ appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ appendStringInfo(buf, "%s[%d]=%d ",
+ sep, i, b125->slot_idxs[i]);
+ sep = ",";
+ }
+
+ appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+ for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+ appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+ appendStringInfo(buf, "\n");
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ if (RT_NODE_IS_LEAF(node))
+ appendStringInfo(buf, "%schunk 0x%X\n",
+ spaces.data, i);
+ else
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+ appendStringInfo(buf, "%schunk 0x%X",
+ spaces.data, i);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i),
+ level + 1, recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+ for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+ appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+ appendStringInfo(buf, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ appendStringInfo(buf, "%schunk 0x%X\n",
+ spaces.data, i);
+ }
+ else
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ appendStringInfo(buf, "%schunk 0x%X",
+ spaces.data, i);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i),
+ level + 1, recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ StringInfoData buf;
+ int shift;
+ int level = 0;
+
+ RT_STATS(tree);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ if (key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+ key, key);
+ return;
+ }
+
+ initStringInfo(&buf);
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child;
+
+ RT_DUMP_NODE(tree, allocnode, level, false, &buf);
+
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_VALUE_TYPE dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+ break;
+ }
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ break;
+
+ allocnode = child;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+ RT_UNLOCK(tree);
+
+ fprintf(stderr, "%s", buf.data);
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+ StringInfoData buf;
+
+ RT_STATS(tree);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ initStringInfo(&buf);
+
+ RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+ RT_UNLOCK(tree);
+
+ fprintf(stderr, "%s",buf.data);
+}
+#endif
+
+#endif /* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef RT_BM_IDX
+#undef RT_BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_3
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_3
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_3
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
+#undef RT_CLASS_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND_UP
+#undef RT_EXTEND_DOWN
+#undef RT_SWITCH_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_RT_ITER_SET_NODE_FROM
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..5f6dda1f12
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,122 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_delete_impl.h
+ * Common implementation for deletion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ * TODO: Shrink nodes when deletion would allow them to fit in a smaller
+ * size class.
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_delete_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+ n3->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+ n3->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
+ n32->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int idx;
+ int bitnum;
+
+ if (slotpos == RT_INVALID_SLOT_IDX)
+ return false;
+
+ idx = RT_BM_IDX(slotpos);
+ bitnum = RT_BM_BIT(slotpos);
+ n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+ n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+ RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+ break;
+ }
+ }
+
+ /* update statistics */
+ node->count--;
+
+ return true;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..d56e58dcac
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,328 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_insert_impl.h
+ * Common implementation for insertion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_insert_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ const bool is_leaf = true;
+ bool chunk_exists = false;
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+ const bool is_leaf = false;
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ int idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n3->values[idx] = *value_p;
+ break;
+ }
+#endif
+ if (unlikely(RT_NODE_MUST_GROW(n3)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE32_TYPE *new32;
+ const uint8 new_kind = RT_NODE_KIND_32;
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
+
+ /* grow node from 3 to 32 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
+ new32->base.chunks, new32->values);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
+ new32->base.chunks, new32->children);
+#endif
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+ int count = n3->base.n.count;
+
+ /* shift chunks and children */
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
+ count, insertpos);
+#endif
+ }
+
+ n3->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n3->values[insertpos] = *value_p;
+#else
+ n3->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ int idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->values[idx] = *value_p;
+ break;
+ }
+#endif
+ if (unlikely(RT_NODE_MUST_GROW(n32)) &&
+ n32->base.n.fanout < class32_max.fanout)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+ Assert(n32->base.n.fanout == class32_min.fanout);
+
+ /* grow to the next size class of this kind */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ n32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ memcpy(newnode, node, class32_min.leaf_size);
+#else
+ memcpy(newnode, node, class32_min.inner_size);
+#endif
+ newnode->fanout = class32_max.fanout;
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n32)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE125_TYPE *new125;
+ const uint8 new_kind = RT_NODE_KIND_125;
+ const RT_SIZE_CLASS new_class = RT_CLASS_125;
+
+ Assert(n32->base.n.fanout == class32_max.fanout);
+
+ /* grow node from 32 to 125 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new125 = (RT_NODE125_TYPE *) newnode;
+
+ for (int i = 0; i < class32_max.fanout; i++)
+ {
+ new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ new125->values[i] = n32->values[i];
+#else
+ new125->children[i] = n32->children[i];
+#endif
+ }
+
+ /*
+ * Since we just copied a dense array, we can set the bits
+ * using a single store, provided the length of that array
+ * is at most the number of bits in a bitmapword.
+ */
+ Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+ new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+ int count = n32->base.n.count;
+
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+ count, insertpos);
+#endif
+ }
+
+ n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = *value_p;
+#else
+ n32->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos;
+ int cnt = 0;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ slotpos = n125->base.slot_idxs[chunk];
+ if (slotpos != RT_INVALID_SLOT_IDX)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n125->values[slotpos] = *value_p;
+ break;
+ }
+#endif
+ if (unlikely(RT_NODE_MUST_GROW(n125)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE256_TYPE *new256;
+ const uint8 new_kind = RT_NODE_KIND_256;
+ const RT_SIZE_CLASS new_class = RT_CLASS_256;
+
+ /* grow node from 125 to 256 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new256 = (RT_NODE256_TYPE *) newnode;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+ RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ cnt++;
+ }
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int idx;
+ bitmapword inverse;
+
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < RT_BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+ {
+ if (n125->base.isset[idx] < ~((bitmapword) 0))
+ break;
+ }
+
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(n125->base.isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+ Assert(slotpos < node->fanout);
+
+ /* mark the slot used */
+ n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+ n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = *value_p;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+ Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
+ RT_NODE_LEAF_256_SET(n256, chunk, *value_p);
+#else
+ Assert(node->count < RT_NODE_MAX_SLOTS);
+ RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+ break;
+ }
+ }
+
+ /* Update statistics */
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!chunk_exists)
+ node->count++;
+#else
+ node->count++;
+#endif
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ RT_VERIFY_NODE(node);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return chunk_exists;
+#else
+ return;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..5c1034768e
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,144 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_iter_impl.h
+ * Common implementation for iteration in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_iter_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 key_chunk = 0;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(value_p != NULL);
+ Assert(RT_NODE_IS_LEAF(node_iter->node));
+#else
+ RT_PTR_LOCAL child = NULL;
+
+ Assert(!RT_NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+ Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
+
+ if (node_iter->idx >= n3->base.n.count)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = n3->values[node_iter->idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->idx]);
+#endif
+ key_chunk = n3->base.chunks[node_iter->idx];
+ node_iter->idx++;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+ if (node_iter->idx >= n32->base.n.count)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = n32->values[node_iter->idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->idx]);
+#endif
+ key_chunk = n32->base.chunks[node_iter->idx];
+ node_iter->idx++;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+ int chunk;
+
+ for (chunk = node_iter->idx; chunk < RT_NODE_MAX_SLOTS; chunk++)
+ {
+ if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, chunk))
+ break;
+ }
+
+ if (chunk >= RT_NODE_MAX_SLOTS)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, chunk));
+#endif
+ key_chunk = chunk;
+ node_iter->idx = chunk + 1;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+ int chunk;
+
+ for (chunk = node_iter->idx; chunk < RT_NODE_MAX_SLOTS; chunk++)
+ {
+#ifdef RT_NODE_LEVEL_LEAF
+ if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ break;
+ }
+
+ if (chunk >= RT_NODE_MAX_SLOTS)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, chunk));
+#endif
+ key_chunk = chunk;
+ node_iter->idx = chunk + 1;
+ break;
+ }
+ }
+
+ /* Update the part of the key */
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << node_iter->node->shift);
+ iter->key |= (((uint64) key_chunk) << node_iter->node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return true;
+#else
+ return child;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..a8925c75d0
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,138 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_search_impl.h
+ * Common implementation for search in leaf and inner nodes, plus
+ * update for inner nodes only.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_search_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(value_p != NULL);
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+ Assert(child_p != NULL);
+#endif
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n3->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = n3->values[idx];
+#else
+ *child_p = n3->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n32->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = n32->values[idx];
+#else
+ *child_p = n32->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+ Assert(slotpos != RT_INVALID_SLOT_IDX);
+ n125->children[slotpos] = new_child;
+#else
+ if (slotpos == RT_INVALID_SLOT_IDX)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+ *child_p = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+ RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+ *child_p = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ }
+
+#ifdef RT_ACTION_UPDATE
+ return;
+#else
+ return true;
+#endif /* RT_ACTION_UPDATE */
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..2af215484f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,6 +121,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
test_pg_db_role_setting \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
subdir('test_pg_db_role_setting')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..7ad1ce3605
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,38 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 3
+NOTICE: testing basic operations with inner node 3
+NOTICE: testing basic operations with leaf node 15
+NOTICE: testing basic operations with inner node 15
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..5a169854d9
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,712 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+/*
+ * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO?
+ */
+static int rt_node_class_fanouts[] = {
+ 3, /* RT_CLASS_3 */
+ 15, /* RT_CLASS_32_MIN */
+ 32, /* RT_CLASS_32_MAX */
+ 125, /* RT_CLASS_125 */
+ 256 /* RT_CLASS_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+/* #define RT_SHMEM */
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ TestValueType dummy;
+ uint64 key;
+ TestValueType val;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_radix_tree");
+ dsa = dsa_create(tranche_id);
+
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ rt_radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_radix_tree");
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* look up keys */
+ for (int i = 0; i < children; i++)
+ {
+ TestValueType value;
+
+ if (!rt_search(radixtree, keys[i], &value))
+ elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (value != (TestValueType) keys[i])
+ elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+ value, (TestValueType) keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ TestValueType update = keys[i] + 1;
+ if (!rt_set(radixtree, keys[i], (TestValueType*) &update))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end)
+{
+ for (int i = start; i <= end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ TestValueType val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != (TestValueType) key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+/*
+ * Insert 256 key-value pairs, and check if keys are properly inserted on each
+ * node class.
+ */
+/* Test keys [0, 256) */
+#define NODE_TYPE_TEST_KEY_MIN 0
+#define NODE_TYPE_TEST_KEY_MAX 256
+static void
+test_node_types_insert_asc(rt_radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+ int node_class_idx = 0;
+ uint64 key_checked = 0;
+
+ for (int i = NODE_TYPE_TEST_KEY_MIN; i < NODE_TYPE_TEST_KEY_MAX; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, (TestValueType *) &key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if ((i + 1) == rt_node_class_fanouts[node_class_idx])
+ {
+ check_search_on_node(radixtree, shift, key_checked, i);
+ key_checked = i;
+ node_class_idx++;
+ }
+ }
+
+ num_entries = rt_num_entries(radixtree);
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Similar to test_node_types_insert_asc(), but inserts keys in descending order.
+ */
+static void
+test_node_types_insert_desc(rt_radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+ int node_class_idx = 0;
+ uint64 key_checked = NODE_TYPE_TEST_KEY_MAX - 1;
+
+ for (int i = NODE_TYPE_TEST_KEY_MAX - 1; i >= NODE_TYPE_TEST_KEY_MIN; i--)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, (TestValueType *) &key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ if ((i + 1) == rt_node_class_fanouts[node_class_idx])
+ {
+ check_search_on_node(radixtree, shift, i, key_checked);
+ key_checked = i;
+ node_class_idx++;
+ }
+ }
+
+ num_entries = rt_num_entries(radixtree);
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = NODE_TYPE_TEST_KEY_MIN; i < NODE_TYPE_TEST_KEY_MAX; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_radix_tree");
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert_asc(radixtree, shift);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert_desc(radixtree, shift);
+
+ rt_free(radixtree);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_radix_tree");
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+#else
+ radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, (TestValueType*) &x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ TestValueType v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != (TestValueType) x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ TestValueType val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != (TestValueType) expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ TestValueType v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ rt_free(radixtree);
+ MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 0; i < lengthof(rt_node_class_fanouts); i++)
+ {
+ test_basic(rt_node_class_fanouts[i], false);
+ test_basic(rt_node_class_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index 2c5042eb41..14b37e8eef 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
--
2.31.1
v30-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchapplication/octet-stream; name=v30-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchDownload
From 22b578551e15e829e6649784eac8ec66d4a455c3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v30 01/11] Introduce helper SIMD functions for small byte
arrays
vector8_min - helper for emulating ">=" semantics
vector8_highbit_mask - used to turn the result of a vector
comparison into a bitmask
Masahiko Sawada
Reviewed by Nathan Bossart, additional adjustments by me
Discussion: https://www.postgresql.org/message-id/CAD21AoDap240WDDdUDE0JMpCmuMMnGajrKrkCRxM7zn9Xk3JRA%40mail.gmail.com
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 1fa6c3bc6c..dfae14e463 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -79,6 +79,7 @@ static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#endif
/* arithmetic operations */
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -299,6 +301,36 @@ vector32_is_highbit_set(const Vector32 v)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Return a bitmask formed from the high-bit of each element.
+ */
+#ifndef USE_NO_SIMD
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ /*
+ * Note: There is a faster way to do this, but it returns a uint64 and
+ * and if the caller wanted to extract the bit position using CTZ,
+ * it would have to divide that result by 4.
+ */
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
/*
* Return the bitwise OR of the inputs
*/
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Given two vectors, return a vector with the minimum element of each.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.31.1
On Thu, Mar 9, 2023 at 1:51 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
I've attached the new version patches. I merged improvements and fixes
I did in the v29 patch.
I haven't yet had a chance to look at those closely, since I've had to
devote time to other commitments. I remember I wasn't particularly
impressed that v29-0008 mixed my requested name-casing changes with a bunch
of other random things. Separating those out would be an obvious way to
make it easier for me to look at, whenever I can get back to this. I need
to look at the iteration changes as well, in addition to testing memory
measurement (thanks for the new results, they look encouraging).
Apart from the memory measurement stuff, I've done another todo item
on my list; adding min max classes for node3 and node125. I've done
This didn't help us move us closer to something committable the first time
you coded this without making sure it was a good idea. It's still not
helping and arguably makes it worse. To be fair, I did speak positively
about _considering_ additional size classes some months ago, but that has a
very obvious maintenance cost, something we can least afford right now.
I'm frankly baffled you thought this was important enough to work on again,
yet thought it was a waste of time to try to prove to ourselves that
autovacuum in a realistic, non-deterministic workload gave the same answer
as the current tid lookup. Even if we had gone that far, it doesn't seem
like a good idea to add non-essential code to critical paths right now.
We're rapidly running out of time, and we're at the point in the cycle
where it's impossible to get meaningful review from anyone not already
intimately familiar with the patch series. I only want to see progress on
addressing possible (especially architectural) objections from the
community, because if they don't notice them now, they surely will after
commit. I have my own list of possible objections as well as bikeshedding
points, which I'll clean up and share next week. I plan to invite Andres to
look at that list and give his impressions, because it's a lot quicker than
reading the patches. Based on that, I'll hopefully be able to decide
whether we have enough time to address any feedback and do remaining
polishing in time for feature freeze.
I'd suggest sharing your todo list in the meanwhile, it'd be good to
discuss what's worth doing and what is not.
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Mar 10, 2023 at 3:42 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Thu, Mar 9, 2023 at 1:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've attached the new version patches. I merged improvements and fixes
I did in the v29 patch.I haven't yet had a chance to look at those closely, since I've had to devote time to other commitments. I remember I wasn't particularly impressed that v29-0008 mixed my requested name-casing changes with a bunch of other random things. Separating those out would be an obvious way to make it easier for me to look at, whenever I can get back to this. I need to look at the iteration changes as well, in addition to testing memory measurement (thanks for the new results, they look encouraging).
Okay, I'll separate them again.
Apart from the memory measurement stuff, I've done another todo item
on my list; adding min max classes for node3 and node125. I've doneThis didn't help us move us closer to something committable the first time you coded this without making sure it was a good idea. It's still not helping and arguably makes it worse. To be fair, I did speak positively about _considering_ additional size classes some months ago, but that has a very obvious maintenance cost, something we can least afford right now.
I'm frankly baffled you thought this was important enough to work on again, yet thought it was a waste of time to try to prove to ourselves that autovacuum in a realistic, non-deterministic workload gave the same answer as the current tid lookup. Even if we had gone that far, it doesn't seem like a good idea to add non-essential code to critical paths right now.
I didn't think that proving tidstore and the current tid lookup return
the same result was a waste of time. I've shared a patch to do that in
tidstore before. I agreed not to add it to the tree but we can test
that using this patch. Actually I've done a test that ran pgbench
workload for a few days.
IIUC it's still important to consider whether to have node1 since it
could be a good alternative for the path compression. The prototype
also implemented it. Of course we can leave it for future improvement.
But considering this item with the performance tests helps us to prove
our decoupling approach is promising.
We're rapidly running out of time, and we're at the point in the cycle where it's impossible to get meaningful review from anyone not already intimately familiar with the patch series. I only want to see progress on addressing possible (especially architectural) objections from the community, because if they don't notice them now, they surely will after commit.
Right, we've been making many design decisions. Some of them are
agreed just between you and me and some are agreed with other hackers.
There are some irrevertible design decisions due to the remaining
time.
I have my own list of possible objections as well as bikeshedding points, which I'll clean up and share next week.
Thanks.
I plan to invite Andres to look at that list and give his impressions, because it's a lot quicker than reading the patches. Based on that, I'll hopefully be able to decide whether we have enough time to address any feedback and do remaining polishing in time for feature freeze.
I'd suggest sharing your todo list in the meanwhile, it'd be good to discuss what's worth doing and what is not.
Apart from more rounds of reviews and tests, my todo items that need
discussion and possibly implementation are:
* The memory measurement in radix trees and the memory limit in
tidstores. I've implemented it in v30-0007 through 0009 but we need to
review it. This is the highest priority for me.
* Additional size classes. It's important for an alternative of path
compression as well as supporting our decoupling approach. Middle
priority.
* Node shrinking support. Low priority.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Fri, Mar 10, 2023 at 11:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Mar 10, 2023 at 3:42 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Thu, Mar 9, 2023 at 1:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've attached the new version patches. I merged improvements and fixes
I did in the v29 patch.I haven't yet had a chance to look at those closely, since I've had to devote time to other commitments. I remember I wasn't particularly impressed that v29-0008 mixed my requested name-casing changes with a bunch of other random things. Separating those out would be an obvious way to make it easier for me to look at, whenever I can get back to this. I need to look at the iteration changes as well, in addition to testing memory measurement (thanks for the new results, they look encouraging).
Okay, I'll separate them again.
Attached new patch series. In addition to separate them again, I've
fixed a conflict with HEAD.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v31-0013-Add-min-and-max-classes-for-node3-and-node125.patchapplication/octet-stream; name=v31-0013-Add-min-and-max-classes-for-node3-and-node125.patchDownload
From 1b43002d25137699d0e13158d821a8550e757348 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 9 Mar 2023 11:42:17 +0900
Subject: [PATCH v31 13/14] Add min and max classes for node3 and node125.
---
src/include/lib/radixtree.h | 70 +++++++++++++------
src/include/lib/radixtree_insert_impl.h | 56 ++++++++++++++-
.../expected/test_radixtree.out | 4 ++
.../modules/test_radixtree/test_radixtree.c | 6 +-
4 files changed, 110 insertions(+), 26 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index f7812eb12a..1759c909b6 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -225,10 +225,12 @@
#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
-#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_3_MIN RT_MAKE_NAME(class_3_min)
+#define RT_CLASS_3_MAX RT_MAKE_NAME(class_3_max)
#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
-#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_125_MIN RT_MAKE_NAME(class_125_min)
+#define RT_CLASS_125_MAX RT_MAKE_NAME(class_125_max)
#define RT_CLASS_256 RT_MAKE_NAME(class_256)
/* generate forward declarations necessary to use the radix tree */
@@ -561,10 +563,12 @@ typedef struct RT_NODE_LEAF_256
*/
typedef enum RT_SIZE_CLASS
{
- RT_CLASS_3 = 0,
+ RT_CLASS_3_MIN = 0,
+ RT_CLASS_3_MAX,
RT_CLASS_32_MIN,
RT_CLASS_32_MAX,
- RT_CLASS_125,
+ RT_CLASS_125_MIN,
+ RT_CLASS_125_MAX,
RT_CLASS_256
} RT_SIZE_CLASS;
@@ -580,7 +584,13 @@ typedef struct RT_SIZE_CLASS_ELEM
} RT_SIZE_CLASS_ELEM;
static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
- [RT_CLASS_3] = {
+ [RT_CLASS_3_MIN] = {
+ .name = "radix tree node 1",
+ .fanout = 1,
+ .inner_size = sizeof(RT_NODE_INNER_3) + 1 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_3) + 1 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_3_MAX] = {
.name = "radix tree node 3",
.fanout = 3,
.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
@@ -598,7 +608,13 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
},
- [RT_CLASS_125] = {
+ [RT_CLASS_125_MIN] = {
+ .name = "radix tree node 125",
+ .fanout = 61,
+ .inner_size = sizeof(RT_NODE_INNER_125) + 61 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_125) + 61 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_125_MAX] = {
.name = "radix tree node 125",
.fanout = 125,
.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
@@ -934,7 +950,7 @@ static inline void
RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
{
- const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_MAX].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
@@ -946,7 +962,7 @@ static inline void
RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
{
- const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_MAX].fanout;
const Size chunk_size = sizeof(uint8) * fanout;
const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
@@ -1152,9 +1168,9 @@ RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
RT_PTR_ALLOC allocnode;
RT_PTR_LOCAL newnode;
- allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_MIN, is_leaf);
newnode = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3_MIN, is_leaf);
newnode->shift = shift;
tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
tree->ctl->root = allocnode;
@@ -1188,17 +1204,21 @@ static inline Size
RT_FANOUT_GET_NODE_SIZE(int fanout, bool is_leaf)
{
const Size fanout_inner_node_size[] = {
- [3] = RT_SIZE_CLASS_INFO[RT_CLASS_3].inner_size,
+ [1] = RT_SIZE_CLASS_INFO[RT_CLASS_3_MIN].inner_size,
+ [3] = RT_SIZE_CLASS_INFO[RT_CLASS_3_MAX].inner_size,
[15] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN].inner_size,
[32] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX].inner_size,
- [125] = RT_SIZE_CLASS_INFO[RT_CLASS_125].inner_size,
+ [61] = RT_SIZE_CLASS_INFO[RT_CLASS_125_MIN].inner_size,
+ [125] = RT_SIZE_CLASS_INFO[RT_CLASS_125_MAX].inner_size,
[256] = RT_SIZE_CLASS_INFO[RT_CLASS_256].inner_size,
};
const Size fanout_leaf_node_size[] = {
- [3] = RT_SIZE_CLASS_INFO[RT_CLASS_3].leaf_size,
+ [1] = RT_SIZE_CLASS_INFO[RT_CLASS_3_MIN].leaf_size,
+ [3] = RT_SIZE_CLASS_INFO[RT_CLASS_3_MAX].leaf_size,
[15] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN].leaf_size,
[32] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX].leaf_size,
- [125] = RT_SIZE_CLASS_INFO[RT_CLASS_125].leaf_size,
+ [61] = RT_SIZE_CLASS_INFO[RT_CLASS_125_MIN].leaf_size,
+ [125] = RT_SIZE_CLASS_INFO[RT_CLASS_125_MAX].leaf_size,
[256] = RT_SIZE_CLASS_INFO[RT_CLASS_256].leaf_size,
};
Size node_size;
@@ -1337,9 +1357,9 @@ RT_EXTEND_UP(RT_RADIX_TREE *tree, uint64 key)
RT_PTR_LOCAL node;
RT_NODE_INNER_3 *n3;
- allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_MIN, true);
node = RT_PTR_GET_LOCAL(tree, allocnode);
- RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+ RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3_MIN, true);
node->shift = shift;
node->count = 1;
@@ -1375,9 +1395,9 @@ RT_EXTEND_DOWN(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_L
int newshift = shift - RT_NODE_SPAN;
bool is_leaf = newshift == 0;
- allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3_MIN, is_leaf);
newchild = RT_PTR_GET_LOCAL(tree, allocchild);
- RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3_MIN, is_leaf);
newchild->shift = newshift;
RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
@@ -2177,12 +2197,14 @@ RT_STATS(RT_RADIX_TREE *tree)
{
RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
- fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+ fprintf(stderr, "height = %d, n1 = %u, n3 = %u, n15 = %u, n32 = %u, n61 = %u, n125 = %u, n256 = %u\n",
root->shift / RT_NODE_SPAN,
- tree->ctl->cnt[RT_CLASS_3],
+ tree->ctl->cnt[RT_CLASS_3_MIN],
+ tree->ctl->cnt[RT_CLASS_3_MAX],
tree->ctl->cnt[RT_CLASS_32_MIN],
tree->ctl->cnt[RT_CLASS_32_MAX],
- tree->ctl->cnt[RT_CLASS_125],
+ tree->ctl->cnt[RT_CLASS_125_MIN],
+ tree->ctl->cnt[RT_CLASS_125_MAX],
tree->ctl->cnt[RT_CLASS_256]);
}
@@ -2519,10 +2541,12 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_SIZE_CLASS
#undef RT_SIZE_CLASS_ELEM
#undef RT_SIZE_CLASS_INFO
-#undef RT_CLASS_3
+#undef RT_CLASS_3_MIN
+#undef RT_CLASS_3_MAX
#undef RT_CLASS_32_MIN
#undef RT_CLASS_32_MAX
-#undef RT_CLASS_125
+#undef RT_CLASS_125_MIN
+#undef RT_CLASS_125_MAX
#undef RT_CLASS_256
/* function declarations */
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index d56e58dcac..d10093dfba 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -42,6 +42,7 @@
{
case RT_NODE_KIND_3:
{
+ const RT_SIZE_CLASS_ELEM class3_max = RT_SIZE_CLASS_INFO[RT_CLASS_3_MAX];
RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
#ifdef RT_NODE_LEVEL_LEAF
@@ -55,6 +56,32 @@
break;
}
#endif
+ if (unlikely(RT_NODE_MUST_GROW(n3)) &&
+ n3->base.n.fanout < class3_max.fanout)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ const RT_SIZE_CLASS_ELEM class3_min = RT_SIZE_CLASS_INFO[RT_CLASS_3_MIN];
+ const RT_SIZE_CLASS new_class = RT_CLASS_3_MAX;
+
+ Assert(n3->base.n.fanout == class3_min.fanout);
+
+ /* grow to the next size class of this kind */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ n3 = (RT_NODE3_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ memcpy(newnode, node, class3_min.leaf_size);
+#else
+ memcpy(newnode, node, class3_min.inner_size);
+#endif
+ newnode->fanout = class3_max.fanout;
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+
if (unlikely(RT_NODE_MUST_GROW(n3)))
{
RT_PTR_ALLOC allocnode;
@@ -154,7 +181,7 @@
RT_PTR_LOCAL newnode;
RT_NODE125_TYPE *new125;
const uint8 new_kind = RT_NODE_KIND_125;
- const RT_SIZE_CLASS new_class = RT_CLASS_125;
+ const RT_SIZE_CLASS new_class = RT_CLASS_125_MIN;
Assert(n32->base.n.fanout == class32_max.fanout);
@@ -213,6 +240,7 @@
/* FALLTHROUGH */
case RT_NODE_KIND_125:
{
+ const RT_SIZE_CLASS_ELEM class125_max = RT_SIZE_CLASS_INFO[RT_CLASS_125_MAX];
RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
int slotpos;
int cnt = 0;
@@ -227,6 +255,32 @@
break;
}
#endif
+ if (unlikely(RT_NODE_MUST_GROW(n125)) &&
+ n125->base.n.fanout < class125_max.fanout)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ const RT_SIZE_CLASS_ELEM class125_min = RT_SIZE_CLASS_INFO[RT_CLASS_125_MIN];
+ const RT_SIZE_CLASS new_class = RT_CLASS_125_MAX;
+
+ Assert(n125->base.n.fanout == class125_min.fanout);
+
+ /* grow to the next size class of this kind */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ n125 = (RT_NODE125_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ memcpy(newnode, node, class125_min.leaf_size);
+#else
+ memcpy(newnode, node, class125_min.inner_size);
+#endif
+ newnode->fanout = class125_max.fanout;
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+
if (unlikely(RT_NODE_MUST_GROW(n125)))
{
RT_PTR_ALLOC allocnode;
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index 7ad1ce3605..f2b1d7e4f8 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -4,12 +4,16 @@ CREATE EXTENSION test_radixtree;
-- an error if something fails.
--
SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 1
+NOTICE: testing basic operations with inner node 1
NOTICE: testing basic operations with leaf node 3
NOTICE: testing basic operations with inner node 3
NOTICE: testing basic operations with leaf node 15
NOTICE: testing basic operations with inner node 15
NOTICE: testing basic operations with leaf node 32
NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 61
+NOTICE: testing basic operations with inner node 61
NOTICE: testing basic operations with leaf node 125
NOTICE: testing basic operations with inner node 125
NOTICE: testing basic operations with leaf node 256
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 19d286d84b..4f38b6e3de 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -47,10 +47,12 @@ static const bool rt_test_stats = false;
* XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO?
*/
static int rt_node_class_fanouts[] = {
- 3, /* RT_CLASS_3 */
+ 1, /* RT_CLASS_3_MIN */
+ 3, /* RT_CLASS_3_MAX */
15, /* RT_CLASS_32_MIN */
32, /* RT_CLASS_32_MAX */
- 125, /* RT_CLASS_125 */
+ 61, /* RT_CLASS_125_MIN */
+ 125, /* RT_CLASS_125_MAX */
256 /* RT_CLASS_256 */
};
/*
--
2.31.1
v31-0011-Remove-the-max-memory-deduction-from-TidStore.patchapplication/octet-stream; name=v31-0011-Remove-the-max-memory-deduction-from-TidStore.patchDownload
From e86e43b93fb901aacd8d2b69aa53ad896c5b5e1c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 8 Mar 2023 15:08:58 +0900
Subject: [PATCH v31 11/14] Remove the max memory deduction from TidStore.
---
src/backend/access/common/tidstore.c | 43 +++++++---------------------
1 file changed, 10 insertions(+), 33 deletions(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 9360520482..ee73759648 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -82,6 +82,7 @@ typedef uint64 offsetbm;
#define RT_SCOPE static
#define RT_DECLARE
#define RT_DEFINE
+#define RT_MEASURE_MEMORY_USAGE
#define RT_VALUE_TYPE tidkey
#include "lib/radixtree.h"
@@ -90,6 +91,7 @@ typedef uint64 offsetbm;
#define RT_SCOPE static
#define RT_DECLARE
#define RT_DEFINE
+#define RT_MEASURE_MEMORY_USAGE
#define RT_VALUE_TYPE tidkey
#include "lib/radixtree.h"
@@ -180,39 +182,15 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
ts = palloc0(sizeof(TidStore));
- /*
- * Create the radix tree for the main storage.
- *
- * Memory consumption depends on the number of stored tids, but also on the
- * distribution of them, how the radix tree stores, and the memory management
- * that backed the radix tree. The maximum bytes that a TidStore can
- * use is specified by the max_bytes in TidStoreCreate(). We want the total
- * amount of memory consumption by a TidStore not to exceed the max_bytes.
- *
- * In local TidStore cases, the radix tree uses slab allocators for each kind
- * of node class. The most memory consuming case while adding Tids associated
- * with one page (i.e. during TidStoreSetBlockOffsets()) is that we allocate a new
- * slab block for a new radix tree node, which is approximately 70kB. Therefore,
- * we deduct 70kB from the max_bytes.
- *
- * In shared cases, DSA allocates the memory segments big enough to follow
- * a geometric series that approximately doubles the total DSA size (see
- * make_new_segment() in dsa.c). We simulated the how DSA increases segment
- * size and the simulation revealed, the 75% threshold for the maximum bytes
- * perfectly works in case where the max_bytes is a power-of-2, and the 60%
- * threshold works for other cases.
- */
if (area != NULL)
{
dsa_pointer dp;
- float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
LWTRANCHE_SHARED_TIDSTORE);
dp = dsa_allocate0(area, sizeof(TidStoreControl));
ts->control = (TidStoreControl *) dsa_get_address(area, dp);
- ts->control->max_bytes = (size_t) (max_bytes * ratio);
ts->area = area;
ts->control->magic = TIDSTORE_MAGIC;
@@ -223,11 +201,15 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
else
{
ts->tree.local = local_rt_create(CurrentMemoryContext);
-
ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
- ts->control->max_bytes = max_bytes - (70 * 1024);
}
+ /*
+ * max_bytes is forced to be at least 64KB, the current minimum valid value
+ * for the work_mem GUC.
+ */
+ ts->control->max_bytes = Max(64 * 1024L, max_bytes);
+
ts->control->max_off = max_off;
ts->control->max_off_nbits = pg_ceil_log2_32(max_off);
@@ -331,14 +313,8 @@ TidStoreReset(TidStore *ts)
LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
- /*
- * Free the radix tree and return allocated DSA segments to
- * the operating system.
- */
- shared_rt_free(ts->tree.shared);
- dsa_trim(ts->area);
-
/* Recreate the radix tree */
+ shared_rt_free(ts->tree.shared);
ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
LWTRANCHE_SHARED_TIDSTORE);
@@ -352,6 +328,7 @@ TidStoreReset(TidStore *ts)
}
else
{
+ /* Recreate the radix tree */
local_rt_free(ts->tree.local);
ts->tree.local = local_rt_create(CurrentMemoryContext);
--
2.31.1
v31-0009-Review-vacuum-integration.patchapplication/octet-stream; name=v31-0009-Review-vacuum-integration.patchDownload
From c1c126e0f4e9f5eeb642bd892bd40948a41b8aae Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 17 Feb 2023 00:04:37 +0900
Subject: [PATCH v31 09/14] Review vacuum integration.
---
doc/src/sgml/monitoring.sgml | 2 +-
src/backend/access/heap/vacuumlazy.c | 61 +++++++++++++--------------
src/backend/commands/vacuum.c | 4 +-
src/backend/commands/vacuumparallel.c | 25 +++++------
4 files changed, 46 insertions(+), 46 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 47b346d36c..61e163636a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7181,7 +7181,7 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
+ <structfield>dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
Amount of dead tuple data collected since the last index vacuum cycle.
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b4e40423a8..edb9079124 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -10,11 +10,10 @@
* of dead TIDs at once.
*
* We are willing to use at most maintenance_work_mem (or perhaps
- * autovacuum_work_mem) memory space to keep track of dead TIDs. We initially
- * create a TidStore with the maximum bytes that can be used by the TidStore.
- * If the TidStore is full, we must call lazy_vacuum to vacuum indexes (and to
- * vacuum the pages that we've pruned). This frees up the memory space dedicated
- * to storing dead TIDs.
+ * autovacuum_work_mem) memory space to keep track of dead TIDs. If the
+ * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
+ * the pages that we've pruned). This frees up the memory space dedicated to
+ * to store dead TIDs.
*
* In practice VACUUM will often complete its initial pass over the target
* heap relation without ever running out of space to store TIDs. This means
@@ -844,7 +843,7 @@ lazy_scan_heap(LVRelState *vacrel)
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
+ initprog_val[2] = TidStoreMaxMemory(vacrel->dead_items);
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -911,7 +910,7 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- if (tidstore_is_full(vacrel->dead_items))
+ if (TidStoreIsFull(vacrel->dead_items))
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -1080,16 +1079,16 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(tidstore_num_tids(dead_items) == 0);
+ Assert(TidStoreNumTids(dead_items) == 0);
}
else if (prunestate.num_offsets > 0)
{
/* Save details of the LP_DEAD items from the page in dead_items */
- tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
- prunestate.num_offsets);
+ TidStoreSetBlockOffsets(dead_items, blkno, prunestate.deadoffsets,
+ prunestate.num_offsets);
pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
- tidstore_memory_usage(dead_items));
+ TidStoreMemoryUsage(dead_items));
}
/*
@@ -1260,7 +1259,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (tidstore_num_tids(dead_items) > 0)
+ if (TidStoreNumTids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -2127,10 +2126,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
+ TidStoreSetBlockOffsets(dead_items, blkno, deadoffsets, lpdead_items);
pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
- tidstore_memory_usage(dead_items));
+ TidStoreMemoryUsage(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2179,7 +2178,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- tidstore_reset(vacrel->dead_items);
+ TidStoreReset(vacrel->dead_items);
return;
}
@@ -2208,7 +2207,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
+ Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2236,7 +2235,7 @@ lazy_vacuum(LVRelState *vacrel)
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
bypass = (vacrel->lpdead_item_pages < threshold) &&
- tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
+ TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2281,7 +2280,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- tidstore_reset(vacrel->dead_items);
+ TidStoreReset(vacrel->dead_items);
}
/*
@@ -2354,7 +2353,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
+ TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2394,7 +2393,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
TidStoreIter *iter;
- TidStoreIterResult *result;
+ TidStoreIterResult *iter_result;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2409,8 +2408,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
VACUUM_ERRCB_PHASE_VACUUM_HEAP,
InvalidBlockNumber, InvalidOffsetNumber);
- iter = tidstore_begin_iterate(vacrel->dead_items);
- while ((result = tidstore_iterate_next(iter)) != NULL)
+ iter = TidStoreBeginIterate(vacrel->dead_items);
+ while ((iter_result = TidStoreIterateNext(iter)) != NULL)
{
BlockNumber blkno;
Buffer buf;
@@ -2419,7 +2418,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- blkno = result->blkno;
+ blkno = iter_result->blkno;
vacrel->blkno = blkno;
/*
@@ -2433,8 +2432,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
- buf, vmbuffer);
+ lazy_vacuum_heap_page(vacrel, blkno, iter_result->offsets,
+ iter_result->num_offsets, buf, vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2444,7 +2443,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
vacuumed_pages++;
}
- tidstore_end_iterate(iter);
+ TidStoreEndIterate(iter);
vacrel->blkno = InvalidBlockNumber;
if (BufferIsValid(vmbuffer))
@@ -2455,12 +2454,12 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* the second heap pass. No more, no less.
*/
Assert(vacrel->num_index_scans > 1 ||
- (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
+ (TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
- vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+ (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, TidStoreNumTids(vacrel->dead_items),
vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
@@ -3118,8 +3117,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- vacrel->dead_items = tidstore_create(vac_work_mem, MaxHeapTuplesPerPage,
- NULL);
+ vacrel->dead_items = TidStoreCreate(vac_work_mem, MaxHeapTuplesPerPage,
+ NULL);
}
/*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 785b825bbc..afedb87941 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2335,7 +2335,7 @@ vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
ereport(ivinfo->message_level,
(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- tidstore_num_tids(dead_items))));
+ TidStoreNumTids(dead_items))));
return istat;
}
@@ -2376,5 +2376,5 @@ vac_tid_reaped(ItemPointer itemptr, void *state)
{
TidStore *dead_items = (TidStore *) state;
- return tidstore_lookup_tid(dead_items, itemptr);
+ return TidStoreIsMember(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index d653683693..9225daf3ab 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -9,11 +9,12 @@
* In a parallel vacuum, we perform both index bulk deletion and index cleanup
* with parallel worker processes. Individual indexes are processed by one
* vacuum process. ParalleVacuumState contains shared information as well as
- * the shared TidStore. We launch parallel worker processes at the start of
- * parallel index bulk-deletion and index cleanup and once all indexes are
- * processed, the parallel worker processes exit. Each time we process indexes
- * in parallel, the parallel context is re-initialized so that the same DSM can
- * be used for multiple passes of index bulk-deletion and index cleanup.
+ * the memory space for storing dead items allocated in the DSA area. We
+ * launch parallel worker processes at the start of parallel index
+ * bulk-deletion and index cleanup and once all indexes are processed, the
+ * parallel worker processes exit. Each time we process indexes in parallel,
+ * the parallel context is re-initialized so that the same DSM can be used for
+ * multiple passes of index bulk-deletion and index cleanup.
*
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
@@ -104,7 +105,7 @@ typedef struct PVShared
pg_atomic_uint32 idx;
/* Handle of the shared TidStore */
- tidstore_handle dead_items_handle;
+ TidStoreHandle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -289,7 +290,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ /* Initial size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -362,7 +363,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
LWTRANCHE_PARALLEL_VACUUM_DSA,
pcxt->seg);
- dead_items = tidstore_create(vac_work_mem, max_offset, dead_items_dsa);
+ dead_items = TidStoreCreate(vac_work_mem, max_offset, dead_items_dsa);
pvs->dead_items = dead_items;
pvs->dead_items_area = dead_items_dsa;
@@ -375,7 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
- shared->dead_items_handle = tidstore_get_handle(dead_items);
+ shared->dead_items_handle = TidStoreGetHandle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -441,7 +442,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
- tidstore_destroy(pvs->dead_items);
+ TidStoreDestroy(pvs->dead_items);
dsa_detach(pvs->dead_items_area);
DestroyParallelContext(pvs->pcxt);
@@ -999,7 +1000,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Set dead items */
area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
dead_items_area = dsa_attach_in_place(area_space, seg);
- dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
+ dead_items = TidStoreAttach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1045,7 +1046,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
- tidstore_detach(pvs.dead_items);
+ TidStoreDetach(dead_items);
dsa_detach(dead_items_area);
/* Pop the error context stack */
--
2.31.1
v31-0007-Review-radix-tree.patchapplication/octet-stream; name=v31-0007-Review-radix-tree.patchDownload
From 2c280fb3697501c70e4ce43808e3a5175bbc5eb2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 20 Feb 2023 11:28:50 +0900
Subject: [PATCH v31 07/14] Review radix tree.
Mainly improve the iteration codes and comments.
---
src/include/lib/radixtree.h | 169 +++++++++---------
src/include/lib/radixtree_iter_impl.h | 85 ++++-----
.../expected/test_radixtree.out | 6 +-
.../modules/test_radixtree/test_radixtree.c | 103 +++++++----
4 files changed, 197 insertions(+), 166 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index e546bd705c..8bea606c62 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -83,7 +83,7 @@
* RT_SET - Set a key-value pair
* RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
* RT_ITERATE_NEXT - Return next key-value pair, if any
- * RT_END_ITER - End iteration
+ * RT_END_ITERATE - End iteration
* RT_MEMORY_USAGE - Get the memory usage
*
* Interface for Shared Memory
@@ -152,8 +152,8 @@
#define RT_INIT_NODE RT_MAKE_NAME(init_node)
#define RT_FREE_NODE RT_MAKE_NAME(free_node)
#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
-#define RT_EXTEND RT_MAKE_NAME(extend)
-#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_EXTEND_UP RT_MAKE_NAME(extend_up)
+#define RT_EXTEND_DOWN RT_MAKE_NAME(extend_down)
#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
@@ -191,7 +191,7 @@
#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
-#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_SET_NODE_FROM RT_MAKE_NAME(iter_set_node_from)
#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
@@ -612,7 +612,6 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
#endif
/* Contains the actual tree and ancillary info */
-// WIP: this name is a bit strange
typedef struct RT_RADIX_TREE_CONTROL
{
#ifdef RT_SHMEM
@@ -651,36 +650,40 @@ typedef struct RT_RADIX_TREE
* Iteration support.
*
* Iterating the radix tree returns each pair of key and value in the ascending
- * order of the key. To support this, the we iterate nodes of each level.
+ * order of the key.
*
- * RT_NODE_ITER struct is used to track the iteration within a node.
+ * RT_NODE_ITER is the struct for iteration of one radix tree node.
*
* RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
- * in order to track the iteration of each level. During iteration, we also
- * construct the key whenever updating the node iteration information, e.g., when
- * advancing the current index within the node or when moving to the next node
- * at the same level.
- *
- * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
- * has the local pointers to nodes, rather than RT_PTR_ALLOC.
- * We need either a safeguard to disallow other processes to begin the iteration
- * while one process is doing or to allow multiple processes to do the iteration.
+ * for each level to track the iteration within the node.
*/
typedef struct RT_NODE_ITER
{
- RT_PTR_LOCAL node; /* current node being iterated */
- int current_idx; /* current position. -1 for initial value */
+ /*
+ * Local pointer to the node we are iterating over.
+ *
+ * Since the radix tree doesn't support the shared iteration among multiple
+ * processes, we use RT_PTR_LOCAL rather than RT_PTR_ALLOC.
+ */
+ RT_PTR_LOCAL node;
+
+ /*
+ * The next index of the chunk array in RT_NODE_KIND_3 and
+ * RT_NODE_KIND_32 nodes, or the next chunk in RT_NODE_KIND_125 and
+ * RT_NODE_KIND_256 nodes. 0 for the initial value.
+ */
+ int idx;
} RT_NODE_ITER;
typedef struct RT_ITER
{
RT_RADIX_TREE *tree;
- /* Track the iteration on nodes of each level */
- RT_NODE_ITER stack[RT_MAX_LEVEL];
- int stack_len;
+ /* Track the nodes for each level. level = 0 is for a leaf node */
+ RT_NODE_ITER node_iters[RT_MAX_LEVEL];
+ int top_level;
- /* The key is constructed during iteration */
+ /* The key constructed during the iteration */
uint64 key;
} RT_ITER;
@@ -1243,7 +1246,7 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
* it can store the key.
*/
static pg_noinline void
-RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+RT_EXTEND_UP(RT_RADIX_TREE *tree, uint64 key)
{
int target_shift;
RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
@@ -1282,7 +1285,7 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
* Insert inner and leaf nodes from 'node' to bottom.
*/
static pg_noinline void
-RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+RT_EXTEND_DOWN(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
{
int shift = node->shift;
@@ -1613,7 +1616,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
/* Extend the tree if necessary */
if (key > tree->ctl->max_val)
- RT_EXTEND(tree, key);
+ RT_EXTEND_UP(tree, key);
stored_child = tree->ctl->root;
parent = RT_PTR_GET_LOCAL(tree, stored_child);
@@ -1631,7 +1634,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
{
- RT_SET_EXTEND(tree, key, value_p, parent, stored_child, child);
+ RT_EXTEND_DOWN(tree, key, value_p, parent, stored_child, child);
RT_UNLOCK(tree);
return false;
}
@@ -1805,16 +1808,9 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
}
#endif
-static inline void
-RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
-{
- iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
- iter->key |= (((uint64) chunk) << shift);
-}
-
/*
- * Advance the slot in the inner node. Return the child if exists, otherwise
- * null.
+ * Scan the inner node and return the next child node if exist, otherwise
+ * return NULL.
*/
static inline RT_PTR_LOCAL
RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
@@ -1825,8 +1821,8 @@ RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
}
/*
- * Advance the slot in the leaf node. On success, return true and the value
- * is set to value_p, otherwise return false.
+ * Scan the leaf node, and return true and the next value is set to value_p
+ * if exists. Otherwise return false.
*/
static inline bool
RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
@@ -1838,29 +1834,50 @@ RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
}
/*
- * Update each node_iter for inner nodes in the iterator node stack.
+ * While descending the radix tree from the 'from' node to the bottom, we
+ * set the next node to iterate for each level.
*/
static void
-RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+RT_ITER_SET_NODE_FROM(RT_ITER *iter, RT_PTR_LOCAL from)
{
- int level = from;
- RT_PTR_LOCAL node = from_node;
+ int level = from->shift / RT_NODE_SPAN;
+ RT_PTR_LOCAL node = from;
for (;;)
{
- RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+ RT_NODE_ITER *node_iter = &(iter->node_iters[level--]);
+
+#ifdef USE_ASSERT_CHECKING
+ if (node_iter->node)
+ {
+ /* We must have finished the iteration on the previous node */
+ if (RT_NODE_IS_LEAF(node_iter->node))
+ {
+ uint64 dummy;
+ Assert(!RT_NODE_LEAF_ITERATE_NEXT(iter, node_iter, &dummy));
+ }
+ else
+ Assert(!RT_NODE_INNER_ITERATE_NEXT(iter, node_iter));
+ }
+#endif
+ /* Set the node to the node iterator of this level */
node_iter->node = node;
- node_iter->current_idx = -1;
+ node_iter->idx = 0;
- /* We don't advance the leaf node iterator here */
if (RT_NODE_IS_LEAF(node))
- return;
+ {
+ /* We will visit the leaf node when RT_ITERATE_NEXT() */
+ break;
+ }
- /* Advance to the next slot in the inner node */
+ /*
+ * Get the first child node from the node, which corresponds to the
+ * lowest chunk within the node.
+ */
node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
- /* We must find the first children in the node */
+ /* The first child must be found */
Assert(node);
}
}
@@ -1874,14 +1891,11 @@ RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
RT_SCOPE RT_ITER *
RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
{
- MemoryContext old_ctx;
RT_ITER *iter;
RT_PTR_LOCAL root;
- int top_level;
- old_ctx = MemoryContextSwitchTo(tree->context);
-
- iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+ iter = (RT_ITER *) MemoryContextAllocZero(tree->context,
+ sizeof(RT_ITER));
iter->tree = tree;
RT_LOCK_SHARED(tree);
@@ -1891,16 +1905,13 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
return iter;
root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
- top_level = root->shift / RT_NODE_SPAN;
- iter->stack_len = top_level;
+ iter->top_level = root->shift / RT_NODE_SPAN;
/*
- * Descend to the left most leaf node from the root. The key is being
- * constructed while descending to the leaf.
+ * Set the next node to iterate for each level from the level of the
+ * root node.
*/
- RT_UPDATE_ITER_STACK(iter, root, top_level);
-
- MemoryContextSwitchTo(old_ctx);
+ RT_ITER_SET_NODE_FROM(iter, root);
return iter;
}
@@ -1912,6 +1923,8 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
RT_SCOPE bool
RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
{
+ Assert(value_p != NULL);
+
/* Empty tree */
if (!iter->tree->ctl->root)
return false;
@@ -1919,43 +1932,38 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
for (;;)
{
RT_PTR_LOCAL child = NULL;
- RT_VALUE_TYPE value;
- int level;
- bool found;
-
- /* Advance the leaf node iterator to get next key-value pair */
- found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
- if (found)
+ /* Get the next chunk of the leaf node */
+ if (RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->node_iters[0]), value_p))
{
*key_p = iter->key;
- *value_p = value;
return true;
}
/*
- * We've visited all values in the leaf node, so advance inner node
- * iterators from the level=1 until we find the next child node.
+ * We've visited all values in the leaf node, so advance all inner node
+ * iterators by visiting inner nodes from the level = 1 until we find the
+ * next inner node that has a child node.
*/
- for (level = 1; level <= iter->stack_len; level++)
+ for (int level = 1; level <= iter->top_level; level++)
{
- child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+ child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->node_iters[level]));
if (child)
break;
}
- /* the iteration finished */
+ /* We've visited all nodes, so the iteration finished */
if (!child)
- return false;
+ break;
/*
- * Set the node to the node iterator and update the iterator stack
- * from this node.
+ * Found the new child node. We update the next node to iterate for each
+ * level from the level of this child node.
*/
- RT_UPDATE_ITER_STACK(iter, child, level - 1);
+ RT_ITER_SET_NODE_FROM(iter, child);
- /* Node iterators are updated, so try again from the leaf */
+ /* Find key-value from the leaf node again */
}
return false;
@@ -2470,8 +2478,8 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_INIT_NODE
#undef RT_FREE_NODE
#undef RT_FREE_RECURSE
-#undef RT_EXTEND
-#undef RT_SET_EXTEND
+#undef RT_EXTEND_UP
+#undef RT_EXTEND_DOWN
#undef RT_SWITCH_NODE_KIND
#undef RT_COPY_NODE
#undef RT_REPLACE_NODE
@@ -2509,8 +2517,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_NODE_INSERT_LEAF
#undef RT_NODE_INNER_ITERATE_NEXT
#undef RT_NODE_LEAF_ITERATE_NEXT
-#undef RT_UPDATE_ITER_STACK
-#undef RT_ITER_UPDATE_KEY
+#undef RT_RT_ITER_SET_NODE_FROM
#undef RT_VERIFY_NODE
#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index 98c78eb237..5c1034768e 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -27,12 +27,10 @@
#error node level must be either inner or leaf
#endif
- bool found = false;
- uint8 key_chunk;
+ uint8 key_chunk = 0;
#ifdef RT_NODE_LEVEL_LEAF
- RT_VALUE_TYPE value;
-
+ Assert(value_p != NULL);
Assert(RT_NODE_IS_LEAF(node_iter->node));
#else
RT_PTR_LOCAL child = NULL;
@@ -50,99 +48,92 @@
{
RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
- node_iter->current_idx++;
- if (node_iter->current_idx >= n3->base.n.count)
- break;
+ if (node_iter->idx >= n3->base.n.count)
+ return false;
+
#ifdef RT_NODE_LEVEL_LEAF
- value = n3->values[node_iter->current_idx];
+ *value_p = n3->values[node_iter->idx];
#else
- child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+ child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->idx]);
#endif
- key_chunk = n3->base.chunks[node_iter->current_idx];
- found = true;
+ key_chunk = n3->base.chunks[node_iter->idx];
+ node_iter->idx++;
break;
}
case RT_NODE_KIND_32:
{
RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
- node_iter->current_idx++;
- if (node_iter->current_idx >= n32->base.n.count)
- break;
+ if (node_iter->idx >= n32->base.n.count)
+ return false;
#ifdef RT_NODE_LEVEL_LEAF
- value = n32->values[node_iter->current_idx];
+ *value_p = n32->values[node_iter->idx];
#else
- child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+ child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->idx]);
#endif
- key_chunk = n32->base.chunks[node_iter->current_idx];
- found = true;
+ key_chunk = n32->base.chunks[node_iter->idx];
+ node_iter->idx++;
break;
}
case RT_NODE_KIND_125:
{
RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
- int i;
+ int chunk;
- for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ for (chunk = node_iter->idx; chunk < RT_NODE_MAX_SLOTS; chunk++)
{
- if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+ if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, chunk))
break;
}
- if (i >= RT_NODE_MAX_SLOTS)
- break;
+ if (chunk >= RT_NODE_MAX_SLOTS)
+ return false;
- node_iter->current_idx = i;
#ifdef RT_NODE_LEVEL_LEAF
- value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+ *value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
#else
- child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, chunk));
#endif
- key_chunk = i;
- found = true;
+ key_chunk = chunk;
+ node_iter->idx = chunk + 1;
break;
}
case RT_NODE_KIND_256:
{
RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
- int i;
+ int chunk;
- for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ for (chunk = node_iter->idx; chunk < RT_NODE_MAX_SLOTS; chunk++)
{
#ifdef RT_NODE_LEVEL_LEAF
- if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+ if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
#else
- if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
#endif
break;
}
- if (i >= RT_NODE_MAX_SLOTS)
- break;
+ if (chunk >= RT_NODE_MAX_SLOTS)
+ return false;
- node_iter->current_idx = i;
#ifdef RT_NODE_LEVEL_LEAF
- value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+ *value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
#else
- child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, chunk));
#endif
- key_chunk = i;
- found = true;
+ key_chunk = chunk;
+ node_iter->idx = chunk + 1;
break;
}
}
- if (found)
- {
- RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
-#ifdef RT_NODE_LEVEL_LEAF
- *value_p = value;
-#endif
- }
+ /* Update the part of the key */
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << node_iter->node->shift);
+ iter->key |= (((uint64) key_chunk) << node_iter->node->shift);
#ifdef RT_NODE_LEVEL_LEAF
- return found;
+ return true;
#else
return child;
#endif
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index ce645cb8b5..7ad1ce3605 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -4,8 +4,10 @@ CREATE EXTENSION test_radixtree;
-- an error if something fails.
--
SELECT test_radixtree();
-NOTICE: testing basic operations with leaf node 4
-NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 3
+NOTICE: testing basic operations with inner node 3
+NOTICE: testing basic operations with leaf node 15
+NOTICE: testing basic operations with inner node 15
NOTICE: testing basic operations with leaf node 32
NOTICE: testing basic operations with inner node 32
NOTICE: testing basic operations with leaf node 125
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index afe53382f3..5a169854d9 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -43,12 +43,15 @@ typedef uint64 TestValueType;
*/
static const bool rt_test_stats = false;
-static int rt_node_kind_fanouts[] = {
- 0,
- 4, /* RT_NODE_KIND_4 */
- 32, /* RT_NODE_KIND_32 */
- 125, /* RT_NODE_KIND_125 */
- 256 /* RT_NODE_KIND_256 */
+/*
+ * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO?
+ */
+static int rt_node_class_fanouts[] = {
+ 3, /* RT_CLASS_3 */
+ 15, /* RT_CLASS_32_MIN */
+ 32, /* RT_CLASS_32_MAX */
+ 125, /* RT_CLASS_125 */
+ 256 /* RT_CLASS_256 */
};
/*
* A struct to define a pattern of integers, for use with the test_pattern()
@@ -260,10 +263,9 @@ test_basic(int children, bool test_inner)
* Check if keys from start to end with the shift exist in the tree.
*/
static void
-check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
- int incr)
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end)
{
- for (int i = start; i < end; i++)
+ for (int i = start; i <= end; i++)
{
uint64 key = ((uint64) i << shift);
TestValueType val;
@@ -277,22 +279,26 @@ check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
}
}
+/*
+ * Insert 256 key-value pairs, and check if keys are properly inserted on each
+ * node class.
+ */
+/* Test keys [0, 256) */
+#define NODE_TYPE_TEST_KEY_MIN 0
+#define NODE_TYPE_TEST_KEY_MAX 256
static void
-test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+test_node_types_insert_asc(rt_radix_tree *radixtree, uint8 shift)
{
- uint64 num_entries;
- int ninserted = 0;
- int start = insert_asc ? 0 : 256;
- int incr = insert_asc ? 1 : -1;
- int end = insert_asc ? 256 : 0;
- int node_kind_idx = 1;
+ uint64 num_entries;
+ int node_class_idx = 0;
+ uint64 key_checked = 0;
- for (int i = start; i != end; i += incr)
+ for (int i = NODE_TYPE_TEST_KEY_MIN; i < NODE_TYPE_TEST_KEY_MAX; i++)
{
uint64 key = ((uint64) i << shift);
bool found;
- found = rt_set(radixtree, key, (TestValueType*) &key);
+ found = rt_set(radixtree, key, (TestValueType *) &key);
if (found)
elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
@@ -300,24 +306,49 @@ test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
* After filling all slots in each node type, check if the values
* are stored properly.
*/
- if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ if ((i + 1) == rt_node_class_fanouts[node_class_idx])
{
- int check_start = insert_asc
- ? rt_node_kind_fanouts[node_kind_idx - 1]
- : rt_node_kind_fanouts[node_kind_idx];
- int check_end = insert_asc
- ? rt_node_kind_fanouts[node_kind_idx]
- : rt_node_kind_fanouts[node_kind_idx - 1];
-
- check_search_on_node(radixtree, shift, check_start, check_end, incr);
- node_kind_idx++;
+ check_search_on_node(radixtree, shift, key_checked, i);
+ key_checked = i;
+ node_class_idx++;
}
-
- ninserted++;
}
num_entries = rt_num_entries(radixtree);
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Similar to test_node_types_insert_asc(), but inserts keys in descending order.
+ */
+static void
+test_node_types_insert_desc(rt_radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+ int node_class_idx = 0;
+ uint64 key_checked = NODE_TYPE_TEST_KEY_MAX - 1;
+
+ for (int i = NODE_TYPE_TEST_KEY_MAX - 1; i >= NODE_TYPE_TEST_KEY_MIN; i--)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, (TestValueType *) &key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+ if ((i + 1) == rt_node_class_fanouts[node_class_idx])
+ {
+ check_search_on_node(radixtree, shift, i, key_checked);
+ key_checked = i;
+ node_class_idx++;
+ }
+ }
+
+ num_entries = rt_num_entries(radixtree);
if (num_entries != 256)
elog(ERROR,
"rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
@@ -329,7 +360,7 @@ test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
{
uint64 num_entries;
- for (int i = 0; i < 256; i++)
+ for (int i = NODE_TYPE_TEST_KEY_MIN; i < NODE_TYPE_TEST_KEY_MAX; i++)
{
uint64 key = ((uint64) i << shift);
bool found;
@@ -379,9 +410,9 @@ test_node_types(uint8 shift)
* then delete all entries to make it empty, and insert and search entries
* again.
*/
- test_node_types_insert(radixtree, shift, true);
+ test_node_types_insert_asc(radixtree, shift);
test_node_types_delete(radixtree, shift);
- test_node_types_insert(radixtree, shift, false);
+ test_node_types_insert_desc(radixtree, shift);
rt_free(radixtree);
#ifdef RT_SHMEM
@@ -664,10 +695,10 @@ test_radixtree(PG_FUNCTION_ARGS)
{
test_empty();
- for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ for (int i = 0; i < lengthof(rt_node_class_fanouts); i++)
{
- test_basic(rt_node_kind_fanouts[i], false);
- test_basic(rt_node_kind_fanouts[i], true);
+ test_basic(rt_node_class_fanouts[i], false);
+ test_basic(rt_node_class_fanouts[i], true);
}
for (int shift = 0; shift <= (64 - 8); shift += 8)
--
2.31.1
v31-0014-Revert-building-benchmark-module-for-CI.patchapplication/octet-stream; name=v31-0014-Revert-building-benchmark-module-for-CI.patchDownload
From 7bae7b13e777c826c542ac33766ad8358672d9cc Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 19:31:34 +0700
Subject: [PATCH v31 14/14] Revert building benchmark module for CI
---
contrib/meson.build | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/contrib/meson.build b/contrib/meson.build
index 421d469f8c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,7 +12,7 @@ subdir('amcheck')
subdir('auth_delay')
subdir('auto_explain')
subdir('basic_archive')
-subdir('bench_radix_tree')
+#subdir('bench_radix_tree')
subdir('bloom')
subdir('basebackup_to_shell')
subdir('bool_plperl')
--
2.31.1
v31-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchapplication/octet-stream; name=v31-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchDownload
From bc5b4650377c4dcb4f108013a5638d6f17cd13ef Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v31 05/14] Tool for measuring radix tree and tidstore
performance
Includes Meson support, but commented out to avoid warnings
XXX: Not for commit
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 88 +++
contrib/bench_radix_tree/bench_radix_tree.c | 747 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/meson.build | 33 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
contrib/meson.build | 1 +
8 files changed, 925 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/meson.build
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..ad66265e23
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,88 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_tidstore_load(
+minblk int4,
+maxblk int4,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT iter_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..6e5149e2c4
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,747 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+//#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+PG_FUNCTION_INFO_V1(bench_tidstore_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+Datum
+bench_tidstore_load(PG_FUNCTION_ARGS)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ TidStore *ts;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
+ OffsetNumber *offs;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_ms;
+ int64 iter_ms;
+ TupleDesc tupdesc;
+ Datum values[3];
+ bool nulls[3] = {false};
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ offs = palloc(sizeof(OffsetNumber) * TIDS_PER_BLOCK_FOR_LOAD);
+ for (int i = 0; i < TIDS_PER_BLOCK_FOR_LOAD; i++)
+ offs[i] = i + 1; /* FirstOffsetNumber is 1 */
+
+ ts = tidstore_create(1 * 1024L * 1024L * 1024L, MaxHeapTuplesPerPage, NULL);
+
+ /* load tids */
+ start_time = GetCurrentTimestamp();
+ for (BlockNumber blkno = minblk; blkno < maxblk; blkno++)
+ tidstore_add_tids(ts, blkno, offs, TIDS_PER_BLOCK_FOR_LOAD);
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_ms = secs * 1000 + usecs / 1000;
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* iterate through tids */
+ iter = tidstore_begin_iterate(ts);
+ start_time = GetCurrentTimestamp();
+ while ((result = tidstore_iterate_next(iter)) != NULL)
+ ;
+ tidstore_end_iterate(iter);
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ iter_ms = secs * 1000 + usecs / 1000;
+
+ values[0] = Int64GetDatum(tidstore_memory_usage(ts));
+ values[1] = Int64GetDatum(load_ms);
+ values[2] = Int64GetDatum(iter_ms);
+
+ tidstore_destroy(ts);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ rt_radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, &val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, &val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, &key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ int64 search_time_ms;
+ Datum values[3] = {0};
+ bool nulls[3] = {0};
+ /* from trial and error */
+ uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (!PG_ARGISNULL(1))
+ {
+ char *filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+ if (sscanf(filter_str, "0x%lX", &filter) == 0)
+ elog(ERROR, "invalid filter string %s", filter_str);
+ }
+ elog(NOTICE, "bench with filter 0x%lX", filter);
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 hash = hash64(i);
+ uint64 key = hash & filter;
+
+ rt_set(rt, key, &key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 hash = hash64(i);
+ uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+ values[2] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ uint64 key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, &key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_sparseload_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ uint64 r,
+ h;
+ uint64 key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ key_id = 0;
+
+ for (r = 1; r <= fanout; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (h);
+
+ key_id++;
+ rt_set(rt, key, &key_id);
+ }
+ }
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+
+ /* measure sparse deletion and re-loading */
+ start_time = GetCurrentTimestamp();
+
+ for (int t = 0; t<10000; t++)
+ {
+ /* delete one key in each leaf */
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ rt_delete(rt, key);
+ }
+
+ /* add them all back */
+ key_id = 0;
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ key_id++;
+ rt_set(rt, key, &key_id);
+ }
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_sparseload_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* to silence warnings about unused iter functions */
+static void pg_attribute_unused()
+stub_iter()
+{
+ rt_radix_tree *rt;
+ rt_iter *iter;
+ uint64 key = 1;
+ uint64 value = 1;
+
+ rt = rt_create(CurrentMemoryContext);
+
+ iter = rt_begin_iterate(rt);
+ rt_iterate_next(iter, &key, &value);
+ rt_end_iterate(iter);
+}
\ No newline at end of file
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+ 'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+ bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'bench_radix_tree',
+ '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+ bench_radix_tree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+ 'bench_radix_tree.control',
+ 'bench_radix_tree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'bench_radix_tree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'bench_radix_tree',
+ ],
+ },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..421d469f8c 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
subdir('auth_delay')
subdir('auto_explain')
subdir('basic_archive')
+subdir('bench_radix_tree')
subdir('bloom')
subdir('basebackup_to_shell')
subdir('bool_plperl')
--
2.31.1
v31-0008-Review-TidStore.patchapplication/octet-stream; name=v31-0008-Review-TidStore.patchDownload
From 6842622ec10cf702fd062caccb091ce5ecbe56b5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Feb 2023 23:45:39 +0900
Subject: [PATCH v31 08/14] Review TidStore.
---
src/backend/access/common/tidstore.c | 340 +++++++++---------
src/include/access/tidstore.h | 37 +-
.../modules/test_tidstore/test_tidstore.c | 68 ++--
3 files changed, 234 insertions(+), 211 deletions(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 8c05e60d92..9360520482 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -3,18 +3,19 @@
* tidstore.c
* Tid (ItemPointerData) storage implementation.
*
- * This module provides a in-memory data structure to store Tids (ItemPointer).
- * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
- * stored in the radix tree.
+ * TidStore is a in-memory data structure to store tids (ItemPointerData).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
+ * and stored in the radix tree.
*
- * A TidStore can be shared among parallel worker processes by passing DSA area
- * to tidstore_create(). Other backends can attach to the shared TidStore by
- * tidstore_attach().
+ * TidStore can be shared among parallel worker processes by passing DSA area
+ * to TidStoreCreate(). Other backends can attach to the shared TidStore by
+ * TidStoreAttach().
*
- * Regarding the concurrency, it basically relies on the concurrency support in
- * the radix tree, but we acquires the lock on a TidStore in some cases, for
- * example, when to reset the store and when to access the number tids in the
- * store (num_tids).
+ * Regarding the concurrency support, we use a single LWLock for the TidStore.
+ * The TidStore is exclusively locked when inserting encoded tids to the
+ * radix tree or when resetting itself. When searching on the TidStore or
+ * doing the iteration, it is not locked but the underlying radix tree is
+ * locked in shared mode.
*
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
@@ -34,16 +35,18 @@
#include "utils/memutils.h"
/*
- * For encoding purposes, tids are represented as a pair of 64-bit key and
- * 64-bit value. First, we construct 64-bit unsigned integer by combining
- * the block number and the offset number. The number of bits used for the
- * offset number is specified by max_offsets in tidstore_create(). We are
- * frugal with the bits, because smaller keys could help keeping the radix
- * tree shallow.
+ * For encoding purposes, a tid is represented as a pair of 64-bit key and
+ * 64-bit value.
*
- * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
- * the offset number and uses the next 32 bits for the block number. That
- * is, only 41 bits are used:
+ * First, we construct a 64-bit unsigned integer by combining the block
+ * number and the offset number. The number of bits used for the offset number
+ * is specified by max_off in TidStoreCreate(). We are frugal with the bits,
+ * because smaller keys could help keeping the radix tree shallow.
+ *
+ * For example, a tid of heap on a 8kB block uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. 9 bits
+ * are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks. That is, only 41 bits are used:
*
* uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
*
@@ -52,30 +55,34 @@
* u = unused bit
* (high on the left, low on the right)
*
- * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
- * on 8kB blocks.
- *
- * The 64-bit value is the bitmap representation of the lowest 6 bits
- * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
- * as the key:
+ * Then, 64-bit value is the bitmap representation of the lowest 6 bits
+ * (LOWER_OFFSET_NBITS) of the integer, and 64-bit key consists of the
+ * upper 3 bits of the offset number and the block number, 35 bits in
+ * total:
*
* uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
* |----| value
- * |---------------------------------------------| key
+ * |--------------------------------------| key
*
* The maximum height of the radix tree is 5 in this case.
+ *
+ * If the number of bits required for offset numbers fits in LOWER_OFFSET_NBITS,
+ * 64-bit value is the bitmap representation of the offset number, and the
+ * 64-bit key is the block number.
*/
-#define TIDSTORE_VALUE_NBITS 6 /* log(64, 2) */
-#define TIDSTORE_OFFSET_MASK ((1 << TIDSTORE_VALUE_NBITS) - 1)
+typedef uint64 tidkey;
+typedef uint64 offsetbm;
+#define LOWER_OFFSET_NBITS 6 /* log(sizeof(offsetbm), 2) */
+#define LOWER_OFFSET_MASK ((1 << LOWER_OFFSET_NBITS) - 1)
-/* A magic value used to identify our TidStores. */
+/* A magic value used to identify our TidStore. */
#define TIDSTORE_MAGIC 0x826f6a10
#define RT_PREFIX local_rt
#define RT_SCOPE static
#define RT_DECLARE
#define RT_DEFINE
-#define RT_VALUE_TYPE uint64
+#define RT_VALUE_TYPE tidkey
#include "lib/radixtree.h"
#define RT_PREFIX shared_rt
@@ -83,7 +90,7 @@
#define RT_SCOPE static
#define RT_DECLARE
#define RT_DEFINE
-#define RT_VALUE_TYPE uint64
+#define RT_VALUE_TYPE tidkey
#include "lib/radixtree.h"
/* The control object for a TidStore */
@@ -94,10 +101,10 @@ typedef struct TidStoreControl
/* These values are never changed after creation */
size_t max_bytes; /* the maximum bytes a TidStore can use */
- int max_offset; /* the maximum offset number */
- int offset_nbits; /* the number of bits required for an offset
- * number */
- int offset_key_nbits; /* the number of bits of an offset number
+ int max_off; /* the maximum offset number */
+ int max_off_nbits; /* the number of bits required for offset
+ * numbers */
+ int upper_off_nbits; /* the number of bits of offset numbers
* used in a key */
/* The below fields are used only in shared case */
@@ -106,7 +113,7 @@ typedef struct TidStoreControl
LWLock lock;
/* handles for TidStore and radix tree */
- tidstore_handle handle;
+ TidStoreHandle handle;
shared_rt_handle tree_handle;
} TidStoreControl;
@@ -147,24 +154,27 @@ typedef struct TidStoreIter
bool finished;
/* save for the next iteration */
- uint64 next_key;
- uint64 next_val;
+ tidkey next_tidkey;
+ offsetbm next_off_bitmap;
- /* output for the caller */
- TidStoreIterResult result;
+ /*
+ * output for the caller. Must be last because variable-size.
+ */
+ TidStoreIterResult output;
} TidStoreIter;
-static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
-static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
-static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit);
-static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit);
+static void iter_decode_key_off(TidStoreIter *iter, tidkey key, offsetbm off_bitmap);
+static inline BlockNumber key_get_blkno(TidStore *ts, tidkey key);
+static inline tidkey encode_blk_off(TidStore *ts, BlockNumber block,
+ OffsetNumber offset, offsetbm *off_bit);
+static inline tidkey encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit);
/*
* Create a TidStore. The returned object is allocated in backend-local memory.
* The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
*/
TidStore *
-tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
{
TidStore *ts;
@@ -176,12 +186,12 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
* Memory consumption depends on the number of stored tids, but also on the
* distribution of them, how the radix tree stores, and the memory management
* that backed the radix tree. The maximum bytes that a TidStore can
- * use is specified by the max_bytes in tidstore_create(). We want the total
+ * use is specified by the max_bytes in TidStoreCreate(). We want the total
* amount of memory consumption by a TidStore not to exceed the max_bytes.
*
* In local TidStore cases, the radix tree uses slab allocators for each kind
* of node class. The most memory consuming case while adding Tids associated
- * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+ * with one page (i.e. during TidStoreSetBlockOffsets()) is that we allocate a new
* slab block for a new radix tree node, which is approximately 70kB. Therefore,
* we deduct 70kB from the max_bytes.
*
@@ -202,7 +212,7 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
dp = dsa_allocate0(area, sizeof(TidStoreControl));
ts->control = (TidStoreControl *) dsa_get_address(area, dp);
- ts->control->max_bytes = (uint64) (max_bytes * ratio);
+ ts->control->max_bytes = (size_t) (max_bytes * ratio);
ts->area = area;
ts->control->magic = TIDSTORE_MAGIC;
@@ -218,14 +228,14 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
ts->control->max_bytes = max_bytes - (70 * 1024);
}
- ts->control->max_offset = max_offset;
- ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+ ts->control->max_off = max_off;
+ ts->control->max_off_nbits = pg_ceil_log2_32(max_off);
- if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
- ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
+ if (ts->control->max_off_nbits < LOWER_OFFSET_NBITS)
+ ts->control->max_off_nbits = LOWER_OFFSET_NBITS;
- ts->control->offset_key_nbits =
- ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+ ts->control->upper_off_nbits =
+ ts->control->max_off_nbits - LOWER_OFFSET_NBITS;
return ts;
}
@@ -235,7 +245,7 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
* allocated in backend-local memory using the CurrentMemoryContext.
*/
TidStore *
-tidstore_attach(dsa_area *area, tidstore_handle handle)
+TidStoreAttach(dsa_area *area, TidStoreHandle handle)
{
TidStore *ts;
dsa_pointer control;
@@ -266,7 +276,7 @@ tidstore_attach(dsa_area *area, tidstore_handle handle)
* to the operating system.
*/
void
-tidstore_detach(TidStore *ts)
+TidStoreDetach(TidStore *ts)
{
Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
@@ -279,12 +289,12 @@ tidstore_detach(TidStore *ts)
*
* TODO: The caller must be certain that no other backend will attempt to
* access the TidStore before calling this function. Other backend must
- * explicitly call tidstore_detach to free up backend-local memory associated
- * with the TidStore. The backend that calls tidstore_destroy must not call
- * tidstore_detach.
+ * explicitly call TidStoreDetach() to free up backend-local memory associated
+ * with the TidStore. The backend that calls TidStoreDestroy() must not call
+ * TidStoreDetach().
*/
void
-tidstore_destroy(TidStore *ts)
+TidStoreDestroy(TidStore *ts)
{
if (TidStoreIsShared(ts))
{
@@ -309,11 +319,11 @@ tidstore_destroy(TidStore *ts)
}
/*
- * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * Forget all collected Tids. It's similar to TidStoreDestroy() but we don't free
* entire TidStore but recreate only the radix tree storage.
*/
void
-tidstore_reset(TidStore *ts)
+TidStoreReset(TidStore *ts)
{
if (TidStoreIsShared(ts))
{
@@ -350,30 +360,34 @@ tidstore_reset(TidStore *ts)
}
}
-/* Add Tids on a block to TidStore */
+/*
+ * Set the given tids on the blkno to TidStore.
+ *
+ * NB: the offset numbers in offsets must be sorted in ascending order.
+ */
void
-tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
- int num_offsets)
+TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
{
- uint64 *values;
- uint64 key;
- uint64 prev_key;
- uint64 off_bitmap = 0;
+ offsetbm *bitmaps;
+ tidkey key;
+ tidkey prev_key;
+ offsetbm off_bitmap = 0;
int idx;
- const uint64 key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
- const int nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+ const tidkey key_base = ((uint64) blkno) << ts->control->upper_off_nbits;
+ const int nkeys = UINT64CONST(1) << ts->control->upper_off_nbits;
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
- values = palloc(sizeof(uint64) * nkeys);
+ bitmaps = palloc(sizeof(offsetbm) * nkeys);
key = prev_key = key_base;
for (int i = 0; i < num_offsets; i++)
{
- uint64 off_bit;
+ offsetbm off_bit;
/* encode the tid to a key and partial offset */
- key = encode_key_off(ts, blkno, offsets[i], &off_bit);
+ key = encode_blk_off(ts, blkno, offsets[i], &off_bit);
/* make sure we scanned the line pointer array in order */
Assert(key >= prev_key);
@@ -384,11 +398,11 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
Assert(idx >= 0 && idx < nkeys);
/* write out offset bitmap for this key */
- values[idx] = off_bitmap;
+ bitmaps[idx] = off_bitmap;
/* zero out any gaps up to the current key */
for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
- values[empty_idx] = 0;
+ bitmaps[empty_idx] = 0;
/* reset for current key -- the current offset will be handled below */
off_bitmap = 0;
@@ -401,7 +415,7 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
/* save the final index for later */
idx = key - key_base;
/* write out last offset bitmap */
- values[idx] = off_bitmap;
+ bitmaps[idx] = off_bitmap;
if (TidStoreIsShared(ts))
LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
@@ -409,14 +423,14 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
/* insert the calculated key-values to the tree */
for (int i = 0; i <= idx; i++)
{
- if (values[i])
+ if (bitmaps[i])
{
key = key_base + i;
if (TidStoreIsShared(ts))
- shared_rt_set(ts->tree.shared, key, &values[i]);
+ shared_rt_set(ts->tree.shared, key, &bitmaps[i]);
else
- local_rt_set(ts->tree.local, key, &values[i]);
+ local_rt_set(ts->tree.local, key, &bitmaps[i]);
}
}
@@ -426,70 +440,70 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
if (TidStoreIsShared(ts))
LWLockRelease(&ts->control->lock);
- pfree(values);
+ pfree(bitmaps);
}
/* Return true if the given tid is present in the TidStore */
bool
-tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+TidStoreIsMember(TidStore *ts, ItemPointer tid)
{
- uint64 key;
- uint64 val = 0;
- uint64 off_bit;
+ tidkey key;
+ offsetbm off_bitmap = 0;
+ offsetbm off_bit;
bool found;
- key = tid_to_key_off(ts, tid, &off_bit);
+ key = encode_tid(ts, tid, &off_bit);
if (TidStoreIsShared(ts))
- found = shared_rt_search(ts->tree.shared, key, &val);
+ found = shared_rt_search(ts->tree.shared, key, &off_bitmap);
else
- found = local_rt_search(ts->tree.local, key, &val);
+ found = local_rt_search(ts->tree.local, key, &off_bitmap);
if (!found)
return false;
- return (val & off_bit) != 0;
+ return (off_bitmap & off_bit) != 0;
}
/*
- * Prepare to iterate through a TidStore. Since the radix tree is locked during the
- * iteration, so tidstore_end_iterate() needs to called when finished.
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during
+ * the iteration, so TidStoreEndIterate() needs to be called when finished.
+ *
+ * The TidStoreIter struct is created in the caller's memory context.
*
* Concurrent updates during the iteration will be blocked when inserting a
* key-value to the radix tree.
*/
TidStoreIter *
-tidstore_begin_iterate(TidStore *ts)
+TidStoreBeginIterate(TidStore *ts)
{
TidStoreIter *iter;
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
- iter = palloc0(sizeof(TidStoreIter));
+ iter = palloc0(sizeof(TidStoreIter) +
+ sizeof(OffsetNumber) * ts->control->max_off);
iter->ts = ts;
- iter->result.blkno = InvalidBlockNumber;
- iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
-
if (TidStoreIsShared(ts))
iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
else
iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
/* If the TidStore is empty, there is no business */
- if (tidstore_num_tids(ts) == 0)
+ if (TidStoreNumTids(ts) == 0)
iter->finished = true;
return iter;
}
static inline bool
-tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+tidstore_iter(TidStoreIter *iter, tidkey *key, offsetbm *off_bitmap)
{
if (TidStoreIsShared(iter->ts))
- return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+ return shared_rt_iterate_next(iter->tree_iter.shared, key, off_bitmap);
- return local_rt_iterate_next(iter->tree_iter.local, key, val);
+ return local_rt_iterate_next(iter->tree_iter.local, key, off_bitmap);
}
/*
@@ -498,45 +512,48 @@ tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
* numbers in each result is also sorted in ascending order.
*/
TidStoreIterResult *
-tidstore_iterate_next(TidStoreIter *iter)
+TidStoreIterateNext(TidStoreIter *iter)
{
- uint64 key;
- uint64 val;
- TidStoreIterResult *result = &(iter->result);
+ tidkey key;
+ offsetbm off_bitmap = 0;
+ TidStoreIterResult *output = &(iter->output);
if (iter->finished)
return NULL;
- if (BlockNumberIsValid(result->blkno))
- {
- /* Process the previously collected key-value */
- result->num_offsets = 0;
- tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
- }
+ /* Initialize the outputs */
+ output->blkno = InvalidBlockNumber;
+ output->num_offsets = 0;
- while (tidstore_iter_kv(iter, &key, &val))
- {
- BlockNumber blkno;
+ /*
+ * Decode the key and offset bitmap that are collected in the previous
+ * time, if exists.
+ */
+ if (iter->next_off_bitmap > 0)
+ iter_decode_key_off(iter, iter->next_tidkey, iter->next_off_bitmap);
- blkno = key_get_blkno(iter->ts, key);
+ while (tidstore_iter(iter, &key, &off_bitmap))
+ {
+ BlockNumber blkno = key_get_blkno(iter->ts, key);
- if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+ if (BlockNumberIsValid(output->blkno) && output->blkno != blkno)
{
/*
- * We got a key-value pair for a different block. So return the
- * collected tids, and remember the key-value for the next iteration.
+ * We got tids for a different block. We return the collected
+ * tids so far, and remember the key-value for the next
+ * iteration.
*/
- iter->next_key = key;
- iter->next_val = val;
- return result;
+ iter->next_tidkey = key;
+ iter->next_off_bitmap = off_bitmap;
+ return output;
}
- /* Collect tids extracted from the key-value pair */
- tidstore_iter_extract_tids(iter, key, val);
+ /* Collect tids decoded from the key and offset bitmap */
+ iter_decode_key_off(iter, key, off_bitmap);
}
iter->finished = true;
- return result;
+ return output;
}
/*
@@ -544,22 +561,21 @@ tidstore_iterate_next(TidStoreIter *iter)
* or when existing an iteration.
*/
void
-tidstore_end_iterate(TidStoreIter *iter)
+TidStoreEndIterate(TidStoreIter *iter)
{
if (TidStoreIsShared(iter->ts))
shared_rt_end_iterate(iter->tree_iter.shared);
else
local_rt_end_iterate(iter->tree_iter.local);
- pfree(iter->result.offsets);
pfree(iter);
}
/* Return the number of tids we collected so far */
int64
-tidstore_num_tids(TidStore *ts)
+TidStoreNumTids(TidStore *ts)
{
- uint64 num_tids;
+ int64 num_tids;
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
@@ -575,16 +591,16 @@ tidstore_num_tids(TidStore *ts)
/* Return true if the current memory usage of TidStore exceeds the limit */
bool
-tidstore_is_full(TidStore *ts)
+TidStoreIsFull(TidStore *ts)
{
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
- return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+ return (TidStoreMemoryUsage(ts) > ts->control->max_bytes);
}
/* Return the maximum memory TidStore can use */
size_t
-tidstore_max_memory(TidStore *ts)
+TidStoreMaxMemory(TidStore *ts)
{
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
@@ -593,7 +609,7 @@ tidstore_max_memory(TidStore *ts)
/* Return the memory usage of TidStore */
size_t
-tidstore_memory_usage(TidStore *ts)
+TidStoreMemoryUsage(TidStore *ts)
{
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
@@ -611,71 +627,75 @@ tidstore_memory_usage(TidStore *ts)
/*
* Get a handle that can be used by other processes to attach to this TidStore
*/
-tidstore_handle
-tidstore_get_handle(TidStore *ts)
+TidStoreHandle
+TidStoreGetHandle(TidStore *ts)
{
Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
return ts->control->handle;
}
-/* Extract tids from the given key-value pair */
+/*
+ * Decode the key and offset bitmap to tids and store them to the iteration
+ * result.
+ */
static void
-tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+iter_decode_key_off(TidStoreIter *iter, tidkey key, offsetbm off_bitmap)
{
- TidStoreIterResult *result = (&iter->result);
+ TidStoreIterResult *output = (&iter->output);
- while (val)
+ while (off_bitmap)
{
- uint64 tid_i;
+ uint64 compressed_tid;
OffsetNumber off;
- tid_i = key << TIDSTORE_VALUE_NBITS;
- tid_i |= pg_rightmost_one_pos64(val);
+ compressed_tid = key << LOWER_OFFSET_NBITS;
+ compressed_tid |= pg_rightmost_one_pos64(off_bitmap);
- off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+ off = compressed_tid & ((UINT64CONST(1) << iter->ts->control->max_off_nbits) - 1);
- Assert(result->num_offsets < iter->ts->control->max_offset);
- result->offsets[result->num_offsets++] = off;
+ Assert(output->num_offsets < iter->ts->control->max_off);
+ output->offsets[output->num_offsets++] = off;
/* unset the rightmost bit */
- val &= ~pg_rightmost_one64(val);
+ off_bitmap &= ~pg_rightmost_one64(off_bitmap);
}
- result->blkno = key_get_blkno(iter->ts, key);
+ output->blkno = key_get_blkno(iter->ts, key);
}
/* Get block number from the given key */
static inline BlockNumber
-key_get_blkno(TidStore *ts, uint64 key)
+key_get_blkno(TidStore *ts, tidkey key)
{
- return (BlockNumber) (key >> ts->control->offset_key_nbits);
+ return (BlockNumber) (key >> ts->control->upper_off_nbits);
}
-/* Encode a tid to key and offset */
-static inline uint64
-tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit)
+/* Encode a tid to key and partial offset */
+static inline tidkey
+encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit)
{
- uint32 offset = ItemPointerGetOffsetNumber(tid);
+ OffsetNumber offset = ItemPointerGetOffsetNumber(tid);
BlockNumber block = ItemPointerGetBlockNumber(tid);
- return encode_key_off(ts, block, offset, off_bit);
+ return encode_blk_off(ts, block, offset, off_bit);
}
/* encode a block and offset to a key and partial offset */
-static inline uint64
-encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit)
+static inline tidkey
+encode_blk_off(TidStore *ts, BlockNumber block, OffsetNumber offset,
+ offsetbm *off_bit)
{
- uint64 key;
- uint64 tid_i;
+ tidkey key;
+ uint64 compressed_tid;
uint32 off_lower;
- off_lower = offset & TIDSTORE_OFFSET_MASK;
- Assert(off_lower < (sizeof(uint64) * BITS_PER_BYTE));
+ off_lower = offset & LOWER_OFFSET_MASK;
+ Assert(off_lower < (sizeof(offsetbm) * BITS_PER_BYTE));
*off_bit = UINT64CONST(1) << off_lower;
- tid_i = offset | ((uint64) block << ts->control->offset_nbits);
- key = tid_i >> TIDSTORE_VALUE_NBITS;
+ compressed_tid = offset | ((uint64) block << ts->control->max_off_nbits);
+ key = compressed_tid >> LOWER_OFFSET_NBITS;
return key;
}
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index a35a52124a..66f0fdd482 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -17,33 +17,34 @@
#include "storage/itemptr.h"
#include "utils/dsa.h"
-typedef dsa_pointer tidstore_handle;
+typedef dsa_pointer TidStoreHandle;
typedef struct TidStore TidStore;
typedef struct TidStoreIter TidStoreIter;
+/* Result struct for TidStoreIterateNext */
typedef struct TidStoreIterResult
{
BlockNumber blkno;
- OffsetNumber *offsets;
int num_offsets;
+ OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
} TidStoreIterResult;
-extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
-extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
-extern void tidstore_detach(TidStore *ts);
-extern void tidstore_destroy(TidStore *ts);
-extern void tidstore_reset(TidStore *ts);
-extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
- int num_offsets);
-extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
-extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
-extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
-extern void tidstore_end_iterate(TidStoreIter *iter);
-extern int64 tidstore_num_tids(TidStore *ts);
-extern bool tidstore_is_full(TidStore *ts);
-extern size_t tidstore_max_memory(TidStore *ts);
-extern size_t tidstore_memory_usage(TidStore *ts);
-extern tidstore_handle tidstore_get_handle(TidStore *ts);
+extern TidStore *TidStoreCreate(size_t max_bytes, int max_off, dsa_area *dsa);
+extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer handle);
+extern void TidStoreDetach(TidStore *ts);
+extern void TidStoreDestroy(TidStore *ts);
+extern void TidStoreReset(TidStore *ts);
+extern void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool TidStoreIsMember(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * TidStoreBeginIterate(TidStore *ts);
+extern TidStoreIterResult *TidStoreIterateNext(TidStoreIter *iter);
+extern void TidStoreEndIterate(TidStoreIter *iter);
+extern int64 TidStoreNumTids(TidStore *ts);
+extern bool TidStoreIsFull(TidStore *ts);
+extern size_t TidStoreMaxMemory(TidStore *ts);
+extern size_t TidStoreMemoryUsage(TidStore *ts);
+extern TidStoreHandle TidStoreGetHandle(TidStore *ts);
#endif /* TIDSTORE_H */
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index 9a1217f833..8659e6780e 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -37,10 +37,10 @@ check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
ItemPointerSet(&tid, blkno, off);
- found = tidstore_lookup_tid(ts, &tid);
+ found = TidStoreIsMember(ts, &tid);
if (found != expect)
- elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+ elog(ERROR, "TidStoreIsMember for TID (%u, %u) returned %d, expected %d",
blkno, off, found, expect);
}
@@ -69,9 +69,9 @@ test_basic(int max_offset)
LWLockRegisterTranche(tranche_id, "test_tidstore");
dsa = dsa_create(tranche_id);
- ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
+ ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
#else
- ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+ ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
#endif
/* prepare the offset array */
@@ -83,7 +83,7 @@ test_basic(int max_offset)
/* add tids */
for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
- tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+ TidStoreSetBlockOffsets(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
/* lookup test */
for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
@@ -105,30 +105,30 @@ test_basic(int max_offset)
}
/* test the number of tids */
- if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
- elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
- tidstore_num_tids(ts),
+ if (TidStoreNumTids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+ elog(ERROR, "TidStoreNumTids returned " UINT64_FORMAT ", expected %d",
+ TidStoreNumTids(ts),
TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
/* iteration test */
- iter = tidstore_begin_iterate(ts);
+ iter = TidStoreBeginIterate(ts);
blk_idx = 0;
- while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+ while ((iter_result = TidStoreIterateNext(iter)) != NULL)
{
/* check the returned block number */
if (blks_sorted[blk_idx] != iter_result->blkno)
- elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+ elog(ERROR, "TidStoreIterateNext returned block number %u, expected %u",
iter_result->blkno, blks_sorted[blk_idx]);
/* check the returned offset numbers */
if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
- elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+ elog(ERROR, "TidStoreIterateNext %u offsets, expected %u",
iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
for (int i = 0; i < iter_result->num_offsets; i++)
{
if (offs[i] != iter_result->offsets[i])
- elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+ elog(ERROR, "TidStoreIterateNext offset number %u on block %u, expected %u",
iter_result->offsets[i], iter_result->blkno, offs[i]);
}
@@ -136,15 +136,15 @@ test_basic(int max_offset)
}
if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
- elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+ elog(ERROR, "TidStoreIterateNext returned %d blocks, expected %d",
blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
/* remove all tids */
- tidstore_reset(ts);
+ TidStoreReset(ts);
/* test the number of tids */
- if (tidstore_num_tids(ts) != 0)
- elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+ if (TidStoreNumTids(ts) != 0)
+ elog(ERROR, "TidStoreNumTids on empty store returned non-zero");
/* lookup test for empty store */
for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
@@ -156,7 +156,7 @@ test_basic(int max_offset)
check_tid(ts, MaxBlockNumber, off, false);
}
- tidstore_destroy(ts);
+ TidStoreDestroy(ts);
#ifdef TEST_SHARED_TIDSTORE
dsa_detach(dsa);
@@ -177,36 +177,37 @@ test_empty(void)
LWLockRegisterTranche(tranche_id, "test_tidstore");
dsa = dsa_create(tranche_id);
- ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
+ ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
#else
- ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+ ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
#endif
elog(NOTICE, "testing empty tidstore");
ItemPointerSet(&tid, 0, FirstOffsetNumber);
- if (tidstore_lookup_tid(ts, &tid))
- elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+ if (TidStoreIsMember(ts, &tid))
+ elog(ERROR, "TidStoreIsMember for TID (%u,%u) on empty store returned true",
+ 0, FirstOffsetNumber);
ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
- if (tidstore_lookup_tid(ts, &tid))
- elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+ if (TidStoreIsMember(ts, &tid))
+ elog(ERROR, "TidStoreIsMember for TID (%u,%u) on empty store returned true",
MaxBlockNumber, MaxOffsetNumber);
- if (tidstore_num_tids(ts) != 0)
- elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+ if (TidStoreNumTids(ts) != 0)
+ elog(ERROR, "TidStoreNumTids on empty store returned non-zero");
- if (tidstore_is_full(ts))
- elog(ERROR, "tidstore_is_full on empty store returned true");
+ if (TidStoreIsFull(ts))
+ elog(ERROR, "TidStoreIsFull on empty store returned true");
- iter = tidstore_begin_iterate(ts);
+ iter = TidStoreBeginIterate(ts);
- if (tidstore_iterate_next(iter) != NULL)
- elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+ if (TidStoreIterateNext(iter) != NULL)
+ elog(ERROR, "TidStoreIterateNext on empty store returned TIDs");
- tidstore_end_iterate(iter);
+ TidStoreEndIterate(iter);
- tidstore_destroy(ts);
+ TidStoreDestroy(ts);
#ifdef TEST_SHARED_TIDSTORE
dsa_detach(dsa);
@@ -221,6 +222,7 @@ test_tidstore(PG_FUNCTION_ARGS)
elog(NOTICE, "testing basic operations");
test_basic(MaxHeapTuplesPerPage);
test_basic(10);
+ test_basic(MaxHeapTuplesPerPage * 2);
PG_RETURN_VOID();
}
--
2.31.1
v31-0003-Add-radixtree-template.patchapplication/octet-stream; name=v31-0003-Add-radixtree-template.patchDownload
From 014d2f9a13af4e9f57ff2f8e44fba61c71ecec66 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v31 03/14] Add radixtree template
WIP: commit message based on template comments
---
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 2516 +++++++++++++++++
src/include/lib/radixtree_delete_impl.h | 122 +
src/include/lib/radixtree_insert_impl.h | 328 +++
src/include/lib/radixtree_iter_impl.h | 153 +
src/include/lib/radixtree_search_impl.h | 138 +
src/include/utils/dsa.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 36 +
src/test/modules/test_radixtree/meson.build | 35 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 681 +++++
.../test_radixtree/test_radixtree.control | 4 +
src/tools/pginclude/cpluspluscheck | 6 +
src/tools/pginclude/headerscheck | 6 +
20 files changed, 4089 insertions(+)
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/include/lib/radixtree_delete_impl.h
create mode 100644 src/include/lib/radixtree_insert_impl.h
create mode 100644 src/include/lib/radixtree_iter_impl.h
create mode 100644 src/include/lib/radixtree_search_impl.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..80555aefff 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..e546bd705c
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2516 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Template for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ * tional leaf node type which stores one value.
+ * - Multi-value leaves: The values are stored in one of four
+ * different leaf node types, which mirror the structure of
+ * inner nodes, but contain values instead of pointers.
+ * - Combined pointer/value slots: If values fit into point-
+ * ers, no separate node types are necessary. Instead, each
+ * pointer storage location in an inner node can either
+ * store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * To handle concurrency, we use a single reader-writer lock for the radix
+ * tree. The radix tree is exclusively locked during write operations such
+ * as RT_SET() and RT_DELETE(), and shared locked during read operations
+ * such as RT_SEARCH(). An iteration also holds the shared lock on the radix
+ * tree until it is completed.
+ *
+ * TODO: The current locking mechanism is not optimized for high concurrency
+ * with mixed read-write workloads. In the future it might be worthwhile
+ * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
+ * the paper "The ART of Practical Synchronization" by the same authors as
+ * the ART paper, 2016.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included. Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * will result in radix tree type 'foo_radix_tree' and functions like
+ * 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ * generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ * declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
+ *
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ * so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE - Create a new, empty radix tree
+ * RT_FREE - Free the radix tree
+ * RT_SEARCH - Search a key-value pair
+ * RT_SET - Set a key-value pair
+ * RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT - Return next key-value pair, if any
+ * RT_END_ITER - End iteration
+ * RT_MEMORY_USAGE - Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH - Attach to the radix tree
+ * RT_DETACH - Detach from the radix tree
+ * RT_GET_HANDLE - Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE - Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif /* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define RT_BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
+#define RT_BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ * statements.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ * in the future to tag the node pointer with the kind, even on
+ * platforms with 32-bit pointers. This might speed up node traversal
+ * in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_3 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Max capacity for the current size class. Storing this in the
+ * node enables multiple size classes per node kind.
+ * Technically, kinds with a single size class don't need this, so we could
+ * keep this in the individual base types, but the code is simpler this way.
+ * Note: node256 is unique in that it cannot possibly have more than a
+ * single size class, so for that kind we store zero, and uint8 is
+ * sufficient for other kinds.
+ */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree) LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree) LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree) LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree) ((void) 0)
+#define RT_LOCK_SHARED(tree) ((void) 0)
+#define RT_UNLOCK(tree) ((void) 0)
+#endif
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+#define RT_NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
+
+#define RT_NODE_MUST_GROW(node) \
+ ((node)->base.n.count == (node)->base.n.fanout)
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_3
+{
+ RT_NODE n;
+
+ /* 3 children, for key chunks */
+ uint8 chunks[3];
+} RT_NODE_BASE_3;
+
+typedef struct RT_NODE_BASE_32
+{
+ RT_NODE n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_125
+{
+ RT_NODE n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* bitmap to track which slots are in use */
+ bitmapword isset[RT_BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+ RT_NODE n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * These are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_3
+{
+ RT_NODE_BASE_3 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_3;
+
+typedef struct RT_NODE_LEAF_3
+{
+ RT_NODE_BASE_3 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_3;
+
+typedef struct RT_NODE_INNER_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+ RT_NODE_BASE_256 base;
+
+ /* Slots for 256 children */
+ RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+ RT_NODE_BASE_256 base;
+
+ /*
+ * Unlike with inner256, zero is a valid value here, so we use a
+ * bitmap to track which slots are in use.
+ */
+ bitmapword isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ RT_VALUE_TYPE values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+ RT_CLASS_3 = 0,
+ RT_CLASS_32_MIN,
+ RT_CLASS_32_MAX,
+ RT_CLASS_125,
+ RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+} RT_SIZE_CLASS_ELEM;
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+ [RT_CLASS_3] = {
+ .name = "radix tree node 3",
+ .fanout = 3,
+ .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_32_MIN] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_32_MAX] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_125] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(RT_NODE_INNER_256),
+ .leaf_size = sizeof(RT_NODE_LEAF_256),
+ },
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+ RT_HANDLE handle;
+ uint32 magic;
+ LWLock lock;
+#endif
+
+ RT_PTR_ALLOC root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+ MemoryContext context;
+
+ /* pointing to either local memory or DSA */
+ RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+ dsa_area *dsa;
+#else
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+ RT_PTR_LOCAL node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+ RT_RADIX_TREE *tree;
+
+ /* Track the iteration on nodes of each level */
+ RT_NODE_ITER stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is constructed during iteration */
+ uint64 key;
+} RT_ITER;
+
+
+static void RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_VALUE_TYPE *value_p);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+ return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+ return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+ return DsaPointerIsValid(ptr);
+#else
+ return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ /* replicate the search key */
+ spread_chunk = vector8_broadcast(chunk);
+
+ /* compare to all 32 keys stored in the node */
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+
+ /* convert comparison to a bitfield */
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+ /* mask off invalid entries */
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ /* convert bitfield to index by counting trailing zeros */
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ /*
+ * This is coded with '>=' to match what we can do with SIMD,
+ * with an assert to keep us honest.
+ */
+ if (node->chunks[index] >= chunk)
+ {
+ Assert(node->chunks[index] != chunk);
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ /*
+ * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+ * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+ * we need to play some trickery using vector8_min() to effectively get
+ * >=. There'll never be any equal elements in current uses, but that's
+ * what we get here...
+ */
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+ uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+ uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+ Assert(RT_NODE_IS_LEAF(node));
+ Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+ return node->children[chunk];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ Assert(RT_NODE_IS_LEAF(node));
+ Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ node->isset[idx] |= ((bitmapword) 1 << bitnum);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+ if (key == 0)
+ return 0;
+ else
+ return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+ RT_PTR_ALLOC allocnode;
+ size_t allocsize;
+
+ if (is_leaf)
+ allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+ else
+ allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+ allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+ if (is_leaf)
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ allocsize);
+ else
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+ allocsize);
+#endif
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->ctl->cnt[size_class]++;
+#endif
+
+ return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+ if (is_leaf)
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+ else
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+
+ node->kind = kind;
+
+ if (kind == RT_NODE_KIND_256)
+ /* See comment for the RT_NODE type */
+ Assert(node->fanout == 0);
+ else
+ node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+ memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
+ }
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static pg_noinline void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+ int shift = RT_KEY_GET_SHIFT(key);
+ bool is_leaf = shift == 0;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ newnode->shift = shift;
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+ tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->count = oldnode->count;
+}
+
+/*
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
+ */
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+ uint8 new_kind, uint8 new_class, bool is_leaf)
+{
+ RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
+ RT_COPY_NODE(newnode, node);
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->ctl->root == allocnode)
+ {
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+ tree->ctl->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->ctl->cnt[i]--;
+ Assert(tree->ctl->cnt[i] >= 0);
+ }
+#endif
+
+#ifdef RT_SHMEM
+ dsa_free(tree->dsa, allocnode);
+#else
+ pfree(allocnode);
+#endif
+}
+
+/* Update the parent's pointer when growing a node */
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static inline void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
+ RT_PTR_ALLOC new_child, uint64 key)
+{
+#ifdef USE_ASSERT_CHECKING
+ RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+ Assert(old_child->shift == new->shift);
+ Assert(old_child->count == new->count);
+#endif
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new larger node */
+ tree->ctl->root = new_child;
+ }
+ else
+ RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+ RT_FREE_NODE(tree, stored_old_child);
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static pg_noinline void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+ int target_shift;
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ int shift = root->shift + RT_NODE_SPAN;
+
+ target_shift = RT_KEY_GET_SHIFT(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ RT_NODE_INNER_3 *n3;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+ node->shift = shift;
+ node->count = 1;
+
+ n3 = (RT_NODE_INNER_3 *) node;
+ n3->base.chunks[0] = 0;
+ n3->children[0] = tree->ctl->root;
+
+ /* Update the root */
+ tree->ctl->root = allocnode;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static pg_noinline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
+{
+ int shift = node->shift;
+
+ Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ RT_PTR_ALLOC allocchild;
+ RT_PTR_LOCAL newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool is_leaf = newshift == 0;
+
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ newchild->shift = newshift;
+ RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
+
+ parent = node;
+ node = newchild;
+ stored_node = allocchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value_p);
+ tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static void
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+ RT_RADIX_TREE *tree;
+ MemoryContext old_ctx;
+#ifdef RT_SHMEM
+ dsa_pointer dp;
+#endif
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+ tree->context = ctx;
+
+#ifdef RT_SHMEM
+ tree->dsa = dsa;
+ dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+ tree->ctl->handle = dp;
+ tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+ LWLockInitialize(&tree->ctl->lock, tranche_id);
+#else
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+ /* Create a slab context for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+ size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+ size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ size_class.name,
+ inner_blocksize,
+ size_class.inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ size_class.name,
+ leaf_blocksize,
+ size_class.leaf_size);
+ }
+#endif
+
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+ RT_RADIX_TREE *tree;
+ dsa_pointer control;
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ tree->dsa = dsa;
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();
+
+ /* The leaf node doesn't have child pointers */
+ if (RT_NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->dsa, ptr);
+ return;
+ }
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+ for (int i = 0; i < n3->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n3->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ for (int i = 0; i < n32->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+ }
+
+ break;
+ }
+ }
+
+ /* Free the inner node */
+ dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ /* Free all memory used for radix tree nodes */
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_FREE_RECURSE(tree, tree->ctl->root);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->dsa, tree->ctl->handle);
+#else
+ pfree(tree->ctl);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+#endif
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+ int shift;
+ bool updated;
+ RT_PTR_LOCAL parent;
+ RT_PTR_ALLOC stored_child;
+ RT_PTR_LOCAL child;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ RT_LOCK_EXCLUSIVE(tree);
+
+ /* Empty tree, create the root */
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_NEW_ROOT(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->ctl->max_val)
+ RT_EXTEND(tree, key);
+
+ stored_child = tree->ctl->root;
+ parent = RT_PTR_GET_LOCAL(tree, stored_child);
+ shift = parent->shift;
+
+ /* Descend the tree until we reach a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;
+
+ child = RT_PTR_GET_LOCAL(tree, stored_child);
+
+ if (RT_NODE_IS_LEAF(child))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
+ {
+ RT_SET_EXTEND(tree, key, value_p, parent, stored_child, child);
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ parent = child;
+ stored_child = new_child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value_p);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->ctl->num_keys++;
+
+ RT_UNLOCK(tree);
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *value_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+ bool found;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+ Assert(value_p != NULL);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ shift = node->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+ if (RT_NODE_IS_LEAF(node))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ node = RT_PTR_GET_LOCAL(tree, child);
+ shift -= RT_NODE_SPAN;
+ }
+
+ found = RT_NODE_SEARCH_LEAF(node, key, value_p);
+
+ RT_UNLOCK(tree);
+ return found;
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ RT_LOCK_EXCLUSIVE(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+ /* Push the current node to the stack */
+ stack[++level] = allocnode;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ allocnode = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_LEAF(node, key);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->ctl->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (node->count > 0)
+ {
+ RT_UNLOCK(tree);
+ return true;
+ }
+
+ /* Free the empty leaf node */
+ RT_FREE_NODE(tree, allocnode);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ allocnode = stack[level--];
+
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_INNER(node, key);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (node->count > 0)
+ break;
+
+ /* The node became empty */
+ RT_FREE_NODE(tree, allocnode);
+ }
+
+ RT_UNLOCK(tree);
+ return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+ RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+ int level = from;
+ RT_PTR_LOCAL node = from_node;
+
+ for (;;)
+ {
+ RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (RT_NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/*
+ * Create and return the iterator for the given radix tree.
+ *
+ * The radix tree is locked in shared mode during the iteration, so
+ * RT_END_ITERATE needs to be called when finished to release the lock.
+ */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+ MemoryContext old_ctx;
+ RT_ITER *iter;
+ RT_PTR_LOCAL root;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+ iter->tree = tree;
+
+ RT_LOCK_SHARED(tree);
+
+ /* empty tree */
+ if (!iter->tree->ctl->root)
+ return iter;
+
+ root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+ top_level = root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->ctl->root)
+ return false;
+
+ for (;;)
+ {
+ RT_PTR_LOCAL child = NULL;
+ RT_VALUE_TYPE value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+/*
+ * Terminate the iteration and release the lock.
+ *
+ * This function needs to be called after finishing or when exiting an
+ * iteration.
+ */
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+#ifdef RT_SHMEM
+ Assert(LWLockHeldByMe(&iter->tree->ctl->lock));
+#endif
+
+ RT_UNLOCK(iter->tree);
+ pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+ Size total = 0;
+
+ RT_LOCK_SHARED(tree);
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ total = dsa_get_total_size(tree->dsa);
+#else
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+#endif
+
+ RT_UNLOCK(tree);
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
+
+ for (int i = 1; i < n3->n.count; i++)
+ Assert(n3->chunks[i - 1] < n3->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ uint8 slot = n125->slot_idxs[i];
+ int idx = RT_BM_IDX(slot);
+ int bitnum = RT_BM_BIT(slot);
+
+ if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(slot < node->fanout);
+ Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
+ cnt += bmw_popcount(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+ RT_LOCK_SHARED(tree);
+
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+ fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+ fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+
+ fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+ root->shift / RT_NODE_SPAN,
+ tree->ctl->cnt[RT_CLASS_3],
+ tree->ctl->cnt[RT_CLASS_32_MIN],
+ tree->ctl->cnt[RT_CLASS_32_MAX],
+ tree->ctl->cnt[RT_CLASS_125],
+ tree->ctl->cnt[RT_CLASS_256]);
+ }
+
+ RT_UNLOCK(tree);
+}
+
+static void
+RT_DUMP_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, int level,
+ bool recurse, StringInfo buf)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+ StringInfoData spaces;
+
+ initStringInfo(&spaces);
+ appendStringInfoSpaces(&spaces, (level * 4) + 1);
+
+ appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u, shift %u:\n",
+ spaces.data,
+ level == 0 ? "" : "-> ",
+ RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_3) ? 3 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+ spaces.data, i, n3->base.chunks[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X",
+ spaces.data, i, n3->base.chunks[i]);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, n3->children[i], level + 1,
+ recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+ spaces.data, i, n32->base.chunks[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X",
+ spaces.data, i, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, n32->children[i], level + 1,
+ recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+ char *sep = "";
+
+ appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ appendStringInfo(buf, "%s[%d]=%d ",
+ sep, i, b125->slot_idxs[i]);
+ sep = ",";
+ }
+
+ appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+ for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+ appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+ appendStringInfo(buf, "\n");
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ if (RT_NODE_IS_LEAF(node))
+ appendStringInfo(buf, "%schunk 0x%X\n",
+ spaces.data, i);
+ else
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+ appendStringInfo(buf, "%schunk 0x%X",
+ spaces.data, i);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i),
+ level + 1, recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+ for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+ appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+ appendStringInfo(buf, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ appendStringInfo(buf, "%schunk 0x%X\n",
+ spaces.data, i);
+ }
+ else
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ appendStringInfo(buf, "%schunk 0x%X",
+ spaces.data, i);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i),
+ level + 1, recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ StringInfoData buf;
+ int shift;
+ int level = 0;
+
+ RT_STATS(tree);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ if (key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+ key, key);
+ return;
+ }
+
+ initStringInfo(&buf);
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child;
+
+ RT_DUMP_NODE(tree, allocnode, level, false, &buf);
+
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_VALUE_TYPE dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+ break;
+ }
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ break;
+
+ allocnode = child;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+ RT_UNLOCK(tree);
+
+ fprintf(stderr, "%s", buf.data);
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+ StringInfoData buf;
+
+ RT_STATS(tree);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ initStringInfo(&buf);
+
+ RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+ RT_UNLOCK(tree);
+
+ fprintf(stderr, "%s",buf.data);
+}
+#endif
+
+#endif /* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef RT_BM_IDX
+#undef RT_BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_3
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_3
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_3
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
+#undef RT_CLASS_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_SWITCH_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..5f6dda1f12
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,122 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_delete_impl.h
+ * Common implementation for deletion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ * TODO: Shrink nodes when deletion would allow them to fit in a smaller
+ * size class.
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_delete_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+ n3->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+ n3->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
+ n32->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int idx;
+ int bitnum;
+
+ if (slotpos == RT_INVALID_SLOT_IDX)
+ return false;
+
+ idx = RT_BM_IDX(slotpos);
+ bitnum = RT_BM_BIT(slotpos);
+ n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+ n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+ RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+ break;
+ }
+ }
+
+ /* update statistics */
+ node->count--;
+
+ return true;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..d56e58dcac
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,328 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_insert_impl.h
+ * Common implementation for insertion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_insert_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ const bool is_leaf = true;
+ bool chunk_exists = false;
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+ const bool is_leaf = false;
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ int idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n3->values[idx] = *value_p;
+ break;
+ }
+#endif
+ if (unlikely(RT_NODE_MUST_GROW(n3)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE32_TYPE *new32;
+ const uint8 new_kind = RT_NODE_KIND_32;
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
+
+ /* grow node from 3 to 32 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
+ new32->base.chunks, new32->values);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
+ new32->base.chunks, new32->children);
+#endif
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+ int count = n3->base.n.count;
+
+ /* shift chunks and children */
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
+ count, insertpos);
+#endif
+ }
+
+ n3->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n3->values[insertpos] = *value_p;
+#else
+ n3->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ int idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->values[idx] = *value_p;
+ break;
+ }
+#endif
+ if (unlikely(RT_NODE_MUST_GROW(n32)) &&
+ n32->base.n.fanout < class32_max.fanout)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+ Assert(n32->base.n.fanout == class32_min.fanout);
+
+ /* grow to the next size class of this kind */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ n32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ memcpy(newnode, node, class32_min.leaf_size);
+#else
+ memcpy(newnode, node, class32_min.inner_size);
+#endif
+ newnode->fanout = class32_max.fanout;
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n32)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE125_TYPE *new125;
+ const uint8 new_kind = RT_NODE_KIND_125;
+ const RT_SIZE_CLASS new_class = RT_CLASS_125;
+
+ Assert(n32->base.n.fanout == class32_max.fanout);
+
+ /* grow node from 32 to 125 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new125 = (RT_NODE125_TYPE *) newnode;
+
+ for (int i = 0; i < class32_max.fanout; i++)
+ {
+ new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ new125->values[i] = n32->values[i];
+#else
+ new125->children[i] = n32->children[i];
+#endif
+ }
+
+ /*
+ * Since we just copied a dense array, we can set the bits
+ * using a single store, provided the length of that array
+ * is at most the number of bits in a bitmapword.
+ */
+ Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+ new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+ int count = n32->base.n.count;
+
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+ count, insertpos);
+#endif
+ }
+
+ n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = *value_p;
+#else
+ n32->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos;
+ int cnt = 0;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ slotpos = n125->base.slot_idxs[chunk];
+ if (slotpos != RT_INVALID_SLOT_IDX)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n125->values[slotpos] = *value_p;
+ break;
+ }
+#endif
+ if (unlikely(RT_NODE_MUST_GROW(n125)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE256_TYPE *new256;
+ const uint8 new_kind = RT_NODE_KIND_256;
+ const RT_SIZE_CLASS new_class = RT_CLASS_256;
+
+ /* grow node from 125 to 256 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new256 = (RT_NODE256_TYPE *) newnode;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+ RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ cnt++;
+ }
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int idx;
+ bitmapword inverse;
+
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < RT_BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+ {
+ if (n125->base.isset[idx] < ~((bitmapword) 0))
+ break;
+ }
+
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(n125->base.isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+ Assert(slotpos < node->fanout);
+
+ /* mark the slot used */
+ n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+ n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = *value_p;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+ Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
+ RT_NODE_LEAF_256_SET(n256, chunk, *value_p);
+#else
+ Assert(node->count < RT_NODE_MAX_SLOTS);
+ RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+ break;
+ }
+ }
+
+ /* Update statistics */
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!chunk_exists)
+ node->count++;
+#else
+ node->count++;
+#endif
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ RT_VERIFY_NODE(node);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return chunk_exists;
+#else
+ return;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..98c78eb237
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,153 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_iter_impl.h
+ * Common implementation for iteration in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_iter_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ bool found = false;
+ uint8 key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_VALUE_TYPE value;
+
+ Assert(RT_NODE_IS_LEAF(node_iter->node));
+#else
+ RT_PTR_LOCAL child = NULL;
+
+ Assert(!RT_NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+ Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n3->base.n.count)
+ break;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n3->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+#endif
+ key_chunk = n3->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+#ifdef RT_NODE_LEVEL_LEAF
+ if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+ if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = value;
+#endif
+ }
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return found;
+#else
+ return child;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..a8925c75d0
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,138 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_search_impl.h
+ * Common implementation for search in leaf and inner nodes, plus
+ * update for inner nodes only.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_search_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(value_p != NULL);
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+ Assert(child_p != NULL);
+#endif
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n3->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = n3->values[idx];
+#else
+ *child_p = n3->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n32->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = n32->values[idx];
+#else
+ *child_p = n32->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+ Assert(slotpos != RT_INVALID_SLOT_IDX);
+ n125->children[slotpos] = new_child;
+#else
+ if (slotpos == RT_INVALID_SLOT_IDX)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+ *child_p = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+ RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+ *child_p = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ }
+
+#ifdef RT_ACTION_UPDATE
+ return;
+#else
+ return true;
+#endif /* RT_ACTION_UPDATE */
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..2af215484f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,6 +121,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
test_pg_db_role_setting \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
subdir('test_pg_db_role_setting')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..afe53382f3
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,681 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int rt_node_kind_fanouts[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 125, /* RT_NODE_KIND_125 */
+ 256 /* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+/* #define RT_SHMEM */
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ TestValueType dummy;
+ uint64 key;
+ TestValueType val;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_radix_tree");
+ dsa = dsa_create(tranche_id);
+
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ rt_radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_radix_tree");
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* look up keys */
+ for (int i = 0; i < children; i++)
+ {
+ TestValueType value;
+
+ if (!rt_search(radixtree, keys[i], &value))
+ elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (value != (TestValueType) keys[i])
+ elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+ value, (TestValueType) keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ TestValueType update = keys[i] + 1;
+ if (!rt_set(radixtree, keys[i], (TestValueType*) &update))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ TestValueType val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != (TestValueType) key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, (TestValueType*) &key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx - 1]
+ : rt_node_kind_fanouts[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx]
+ : rt_node_kind_fanouts[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_radix_tree");
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_radix_tree");
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+#else
+ radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, (TestValueType*) &x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ TestValueType v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != (TestValueType) x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ TestValueType val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != (TestValueType) expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ TestValueType v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ rt_free(radixtree);
+ MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ test_basic(rt_node_kind_fanouts[i], false);
+ test_basic(rt_node_kind_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index b0e9aa99a2..2f72d5ed4b 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index 8dee1b5670..133313255c 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
--
2.31.1
v31-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchapplication/octet-stream; name=v31-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload
From b3ac3b456aa1448f3e959674f16bed18630266be Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 7 Feb 2023 17:19:29 +0700
Subject: [PATCH v31 06/14] Use TIDStore for storing dead tuple TID during lazy
vacuum
Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which was not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.
Now we use TIDStore to store dead tuple TIDs. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.
Since we are no longer able to exactly estimate the maximum number of
TIDs can be stored the pg_stat_progress_vacuum shows the progress
information based on the amount of memory in bytes. The column names
are also changed to max_dead_tuple_bytes and num_dead_tuple_bytes.
In addition, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, the inital DSA
segment size. Due to that, we increase the minimum value of
maintenance_work_mem (also autovacuum_work_mem) from 1MB to 2MB.
XXX: needs to bump catalog version
---
doc/src/sgml/monitoring.sgml | 8 +-
src/backend/access/heap/vacuumlazy.c | 278 ++++++++-------------
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 78 +-----
src/backend/commands/vacuumparallel.c | 73 +++---
src/backend/postmaster/autovacuum.c | 6 +-
src/backend/storage/lmgr/lwlock.c | 2 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/commands/progress.h | 4 +-
src/include/commands/vacuum.h | 25 +-
src/include/storage/lwlock.h | 1 +
src/test/regress/expected/cluster.out | 2 +-
src/test/regress/expected/create_index.out | 2 +-
src/test/regress/expected/rules.out | 4 +-
src/test/regress/sql/cluster.sql | 2 +-
src/test/regress/sql/create_index.sql | 2 +-
16 files changed, 177 insertions(+), 314 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 97d588b1d8..47b346d36c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7170,10 +7170,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>max_dead_tuples</structfield> <type>bigint</type>
+ <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples that we can store before needing to perform
+ Amount of dead tuple data that we can store before needing to perform
an index vacuum cycle, based on
<xref linkend="guc-maintenance-work-mem"/>.
</para></entry>
@@ -7181,10 +7181,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>num_dead_tuples</structfield> <type>bigint</type>
+ <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples collected since the last index vacuum cycle.
+ Amount of dead tuple data collected since the last index vacuum cycle.
</para></entry>
</row>
</tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..b4e40423a8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,18 @@
* vacuumlazy.c
* Concurrent ("lazy") vacuuming.
*
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs
* that are to be removed from indexes. We want to ensure we can vacuum even
* the very largest relations with finite memory space usage. To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
*
* We are willing to use at most maintenance_work_mem (or perhaps
* autovacuum_work_mem) memory space to keep track of dead TIDs. We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables). If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * create a TidStore with the maximum bytes that can be used by the TidStore.
+ * If the TidStore is full, we must call lazy_vacuum to vacuum indexes (and to
+ * vacuum the pages that we've pruned). This frees up the memory space dedicated
+ * to storing dead TIDs.
*
* In practice VACUUM will often complete its initial pass over the target
* heap relation without ever running out of space to store TIDs. This means
@@ -40,6 +40,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TidStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -220,11 +221,14 @@ typedef struct LVRelState
typedef struct LVPagePruneState
{
bool hastup; /* Page prevents rel truncation? */
- bool has_lpdead_items; /* includes existing LP_DEAD items */
+
+ /* collected offsets of LP_DEAD items including existing ones */
+ OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+ int num_offsets;
/*
* State describes the proper VM bit states to set for the page following
- * pruning and freezing. all_visible implies !has_lpdead_items, but don't
+ * pruning and freezing. all_visible implies num_offsets == 0, but don't
* trust all_frozen result unless all_visible is also set to true.
*/
bool all_visible; /* Every item visible to all? */
@@ -259,8 +263,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -487,11 +492,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
}
/*
- * Allocate dead_items array memory using dead_items_alloc. This handles
- * parallel VACUUM initialization as part of allocating shared memory
- * space used for dead_items. (But do a failsafe precheck first, to
- * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
- * is already dangerously old.)
+ * Allocate dead_items memory using dead_items_alloc. This handles parallel
+ * VACUUM initialization as part of allocating shared memory space used for
+ * dead_items. (But do a failsafe precheck first, to ensure that parallel
+ * VACUUM won't be attempted at all when relfrozenxid is already dangerously
+ * old.)
*/
lazy_check_wraparound_failsafe(vacrel);
dead_items_alloc(vacrel, params->nworkers);
@@ -797,7 +802,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
* have collected the TIDs whose index tuples need to be removed.
*
* Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- * largely consists of marking LP_DEAD items (from collected TID array)
+ * largely consists of marking LP_DEAD items (from vacrel->dead_items)
* as LP_UNUSED. This has to happen in a second, final pass over the
* heap, to preserve a basic invariant that all index AMs rely on: no
* extant index tuple can ever be allowed to contain a TID that points to
@@ -825,21 +830,21 @@ lazy_scan_heap(LVRelState *vacrel)
blkno,
next_unskippable_block,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TidStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
const int initprog_index[] = {
PROGRESS_VACUUM_PHASE,
PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
- PROGRESS_VACUUM_MAX_DEAD_TUPLES
+ PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
};
int64 initprog_val[3];
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +911,7 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ if (tidstore_is_full(vacrel->dead_items))
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -969,7 +973,7 @@ lazy_scan_heap(LVRelState *vacrel)
continue;
}
- /* Collect LP_DEAD items in dead_items array, count tuples */
+ /* Collect LP_DEAD items in dead_items, count tuples */
if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
&recordfreespace))
{
@@ -1011,14 +1015,14 @@ lazy_scan_heap(LVRelState *vacrel)
* Prune, freeze, and count tuples.
*
* Accumulates details of remaining LP_DEAD line pointers on page in
- * dead_items array. This includes LP_DEAD line pointers that we
- * pruned ourselves, as well as existing LP_DEAD line pointers that
- * were pruned some time earlier. Also considers freezing XIDs in the
- * tuple headers of remaining items with storage.
+ * dead_items. This includes LP_DEAD line pointers that we pruned
+ * ourselves, as well as existing LP_DEAD line pointers that were pruned
+ * some time earlier. Also considers freezing XIDs in the tuple headers
+ * of remaining items with storage.
*/
lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
- Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+ Assert(!prunestate.all_visible || (prunestate.num_offsets == 0));
/* Remember the location of the last page with nonremovable tuples */
if (prunestate.hastup)
@@ -1034,14 +1038,12 @@ lazy_scan_heap(LVRelState *vacrel)
* performed here can be thought of as the one-pass equivalent of
* a call to lazy_vacuum().
*/
- if (prunestate.has_lpdead_items)
+ if (prunestate.num_offsets > 0)
{
Size freespace;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
- /* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+ prunestate.num_offsets, buf, vmbuffer);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1080,16 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(tidstore_num_tids(dead_items) == 0);
+ }
+ else if (prunestate.num_offsets > 0)
+ {
+ /* Save details of the LP_DEAD items from the page in dead_items */
+ tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
+ prunestate.num_offsets);
+
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
}
/*
@@ -1145,7 +1156,7 @@ lazy_scan_heap(LVRelState *vacrel)
* There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
* set, however.
*/
- else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+ else if ((prunestate.num_offsets > 0) && PageIsAllVisible(page))
{
elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
vacrel->relname, blkno);
@@ -1193,7 +1204,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Final steps for block: drop cleanup lock, record free space in the
* FSM
*/
- if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+ if ((prunestate.num_offsets > 0) && vacrel->do_index_vacuuming)
{
/*
* Wait until lazy_vacuum_heap_rel() to save free space. This
@@ -1249,7 +1260,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (tidstore_num_tids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1524,9 +1535,9 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
* The approach we take now is to restart pruning when the race condition is
* detected. This allows heap_page_prune() to prune the tuples inserted by
* the now-aborted transaction. This is a little crude, but it guarantees
- * that any items that make it into the dead_items array are simple LP_DEAD
- * line pointers, and that every remaining item with tuple storage is
- * considered as a candidate for freezing.
+ * that any items that make it into the dead_items are simple LP_DEAD line
+ * pointers, and that every remaining item with tuple storage is considered
+ * as a candidate for freezing.
*/
static void
lazy_scan_prune(LVRelState *vacrel,
@@ -1543,13 +1554,11 @@ lazy_scan_prune(LVRelState *vacrel,
HTSV_Result res;
int tuples_deleted,
tuples_frozen,
- lpdead_items,
live_tuples,
recently_dead_tuples;
int nnewlpdead;
HeapPageFreeze pagefrz;
int64 fpi_before = pgWalUsage.wal_fpi;
- OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1571,7 +1580,6 @@ retry:
pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
tuples_deleted = 0;
tuples_frozen = 0;
- lpdead_items = 0;
live_tuples = 0;
recently_dead_tuples = 0;
@@ -1580,9 +1588,9 @@ retry:
*
* We count tuples removed by the pruning step as tuples_deleted. Its
* final value can be thought of as the number of tuples that have been
- * deleted from the table. It should not be confused with lpdead_items;
- * lpdead_items's final value can be thought of as the number of tuples
- * that were deleted from indexes.
+ * deleted from the table. It should not be confused with
+ * prunestate->deadoffsets; prunestate->deadoffsets's final value can
+ * be thought of as the number of tuples that were deleted from indexes.
*/
tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
InvalidTransactionId, 0, &nnewlpdead,
@@ -1593,7 +1601,7 @@ retry:
* requiring freezing among remaining tuples with storage
*/
prunestate->hastup = false;
- prunestate->has_lpdead_items = false;
+ prunestate->num_offsets = 0;
prunestate->all_visible = true;
prunestate->all_frozen = true;
prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1638,7 +1646,7 @@ retry:
* (This is another case where it's useful to anticipate that any
* LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
*/
- deadoffsets[lpdead_items++] = offnum;
+ prunestate->deadoffsets[prunestate->num_offsets++] = offnum;
continue;
}
@@ -1875,7 +1883,7 @@ retry:
*/
#ifdef USE_ASSERT_CHECKING
/* Note that all_frozen value does not matter when !all_visible */
- if (prunestate->all_visible && lpdead_items == 0)
+ if (prunestate->all_visible && prunestate->num_offsets == 0)
{
TransactionId cutoff;
bool all_frozen;
@@ -1888,28 +1896,9 @@ retry:
}
#endif
- /*
- * Now save details of the LP_DEAD items from the page in vacrel
- */
- if (lpdead_items > 0)
+ if (prunestate->num_offsets > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
-
vacrel->lpdead_item_pages++;
- prunestate->has_lpdead_items = true;
-
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
/*
* It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1928,7 +1917,7 @@ retry:
/* Finally, add page-local counts to whole-VACUUM counts */
vacrel->tuples_deleted += tuples_deleted;
vacrel->tuples_frozen += tuples_frozen;
- vacrel->lpdead_items += lpdead_items;
+ vacrel->lpdead_items += prunestate->num_offsets;
vacrel->live_tuples += live_tuples;
vacrel->recently_dead_tuples += recently_dead_tuples;
}
@@ -1940,7 +1929,7 @@ retry:
* lazy_scan_prune, which requires a full cleanup lock. While pruning isn't
* performed here, it's quite possible that an earlier opportunistic pruning
* operation left LP_DEAD items behind. We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items for removal from indexes.
*
* For aggressive VACUUM callers, we may return false to indicate that a full
* cleanup lock is required for processing by lazy_scan_prune. This is only
@@ -2099,7 +2088,7 @@ lazy_scan_noprune(LVRelState *vacrel,
vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
vacrel->NewRelminMxid = NoFreezePageRelminMxid;
- /* Save any LP_DEAD items found on the page in dead_items array */
+ /* Save any LP_DEAD items found on the page in dead_items */
if (vacrel->nindexes == 0)
{
/* Using one-pass strategy (since table has no indexes) */
@@ -2129,8 +2118,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TidStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2127,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2198,7 +2179,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
return;
}
@@ -2227,7 +2208,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2254,8 +2235,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2300,7 +2281,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
}
/*
@@ -2373,7 +2354,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || vacrel->failsafe_active);
/*
@@ -2392,9 +2373,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
/*
* lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
*
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
*
* We may also be able to truncate the line pointer array of the heap pages we
* visit. If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2410,10 +2390,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index = 0;
BlockNumber vacuumed_pages = 0;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2409,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
VACUUM_ERRCB_PHASE_VACUUM_HEAP,
InvalidBlockNumber, InvalidOffsetNumber);
- while (index < vacrel->dead_items->num_items)
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ while ((result = tidstore_iterate_next(iter)) != NULL)
{
BlockNumber blkno;
Buffer buf;
@@ -2437,7 +2419,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ blkno = result->blkno;
vacrel->blkno = blkno;
/*
@@ -2451,7 +2433,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+ buf, vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2461,6 +2444,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
vacuumed_pages++;
}
+ tidstore_end_iterate(iter);
vacrel->blkno = InvalidBlockNumber;
if (BufferIsValid(vmbuffer))
@@ -2470,36 +2454,31 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+ vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
}
/*
- * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- * vacrel->dead_items array.
+ * lazy_vacuum_heap_page() -- free page's LP_DEAD items.
*
* Caller must have an exclusive buffer lock on the buffer (though a full
* cleanup lock is also acceptable). vmbuffer must be valid and already have
* a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page. The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *deadoffsets, int num_offsets, Buffer buffer,
+ Buffer vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int nunused = 0;
@@ -2518,16 +2497,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = deadoffsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -2687,8 +2660,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
* lazy_vacuum_one_index() -- vacuum index relation.
*
* Delete all the index tuples containing a TID collected in
- * vacrel->dead_items array. Also update running statistics.
- * Exact details depend on index AM's ambulkdelete routine.
+ * vacrel->dead_items. Also update running statistics. Exact
+ * details depend on index AM's ambulkdelete routine.
*
* reltuples is the number of heap tuples to be passed to the
* bulkdelete callback. It's always assumed to be estimated.
@@ -3094,48 +3067,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
}
/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
-/*
- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
+ * in vacrel for caller.
*
* Also handles parallel initialization as part of allocating dead_items in
* DSM when required.
@@ -3143,11 +3076,9 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
+ int vac_work_mem = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
@@ -3174,7 +3105,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
+ vac_work_mem, MaxHeapTuplesPerPage,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3187,11 +3118,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = tidstore_create(vac_work_mem, MaxHeapTuplesPerPage,
+ NULL);
}
/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 34ca0e739f..149d41b41c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,7 +1180,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
END AS phase,
S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
- S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+ S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 2e12baf8eb..785b825bbc 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int vac_cmp_itemptr(const void *left, const void *right);
/*
* Primary entry point for manual VACUUM and ANALYZE commands
@@ -2327,16 +2326,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TidStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ tidstore_num_tids(dead_items))));
return istat;
}
@@ -2367,82 +2366,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
* This has the right signature to be an IndexBulkDeleteCallback.
- *
- * Assumes dead_items array is sorted (in ascending TID order).
*/
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch(itemptr,
- dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
-
- return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
- BlockNumber lblk,
- rblk;
- OffsetNumber loff,
- roff;
-
- lblk = ItemPointerGetBlockNumber((ItemPointer) left);
- rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
- if (lblk < rblk)
- return -1;
- if (lblk > rblk)
- return 1;
-
- loff = ItemPointerGetOffsetNumber((ItemPointer) left);
- roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
- if (loff < roff)
- return -1;
- if (loff > roff)
- return 1;
+ TidStore *dead_items = (TidStore *) state;
- return 0;
+ return tidstore_lookup_tid(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..d653683693 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -9,12 +9,11 @@
* In a parallel vacuum, we perform both index bulk deletion and index cleanup
* with parallel worker processes. Individual indexes are processed by one
* vacuum process. ParalleVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment. We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit. Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * the shared TidStore. We launch parallel worker processes at the start of
+ * parallel index bulk-deletion and index cleanup and once all indexes are
+ * processed, the parallel worker processes exit. Each time we process indexes
+ * in parallel, the parallel context is re-initialized so that the same DSM can
+ * be used for multiple passes of index bulk-deletion and index cleanup.
*
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
@@ -103,6 +102,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TidStore */
+ tidstore_handle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -166,7 +168,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ dsa_area *dead_items_area;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -222,20 +225,23 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int max_offset, int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -283,9 +289,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -351,6 +356,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = tidstore_create(vac_work_mem, max_offset, dead_items_dsa);
+ pvs->dead_items = dead_items;
+ pvs->dead_items_area = dead_items_dsa;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -360,6 +375,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = tidstore_get_handle(dead_items);
pg_atomic_init_u32(&(shared->cost_balance), 0);
pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +384,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -434,6 +441,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ tidstore_destroy(pvs->dead_items);
+ dsa_detach(pvs->dead_items_area);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -442,7 +452,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TidStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -940,7 +950,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -984,10 +996,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1045,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ tidstore_detach(pvs.dead_items);
+ dsa_detach(dead_items_area);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index c0e2e00a7e..60caeae739 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3399,12 +3399,12 @@ check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
return true;
/*
- * We clamp manually-set values to at least 1MB. Since
+ * We clamp manually-set values to at least 2MB. Since
* maintenance_work_mem is always set to at least this value, do the same
* here.
*/
- if (*newval < 1024)
- *newval = 1024;
+ if (*newval < 2048)
+ *newval = 2048;
return true;
}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 55b3a04097..c223a7dc94 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -192,6 +192,8 @@ static const char *const BuiltinTrancheNames[] = {
"LogicalRepLauncherDSA",
/* LWTRANCHE_LAUNCHER_HASH: */
"LogicalRepLauncherHash",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1c0583fe26..8a64614cd1 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2313,7 +2313,7 @@ struct config_int ConfigureNamesInt[] =
GUC_UNIT_KB
},
&maintenance_work_mem,
- 65536, 1024, MAX_KILOBYTES,
+ 65536, 2048, MAX_KILOBYTES,
NULL, NULL, NULL
},
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
#define PROGRESS_VACUUM_HEAP_BLKS_SCANNED 2
#define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED 3
#define PROGRESS_VACUUM_NUM_INDEX_VACUUMS 4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES 5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES 5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES 6
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index bdfd96cfec..cec2d1d356 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -277,21 +278,6 @@ struct VacuumCutoffs
MultiXactId MultiXactCutoff;
};
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -340,18 +326,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TidStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int vac_work_mem, int max_offset,
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 07002fdfbe..537b34b30c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DATA,
LWTRANCHE_LAUNCHER_DSA,
LWTRANCHE_LAUNCHER_HASH,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
-- ensure we don't use the index in CLUSTER nor the checking SELECTs
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index acfd9d1f4f..d320ad87dd 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
-- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e953d1f515..ef46c2994f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2032,8 +2032,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param3 AS heap_blks_scanned,
s.param4 AS heap_blks_vacuumed,
s.param5 AS index_vacuum_count,
- s.param6 AS max_dead_tuples,
- s.param7 AS num_dead_tuples
+ s.param6 AS max_dead_tuple_bytes,
+ s.param7 AS dead_tuple_bytes
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f300..d6e2471b00 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
--
2.31.1
v31-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchapplication/octet-stream; name=v31-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload
From 46ccfc2d0b588e090d1f46bc16f463789227aff4 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v31 02/14] Move some bitmap logic out of bitmapset.c
Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.
Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().
Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.
Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
src/backend/nodes/bitmapset.c | 34 +-------------------------------
src/include/nodes/bitmapset.h | 16 +++++++++++++--
src/include/port/pg_bitutils.h | 31 +++++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 -
4 files changed, 46 insertions(+), 36 deletions(-)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 7ba3cf635b..0b2962ed73 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -30,39 +30,7 @@
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x))
static bool bms_is_empty_internal(const Bitmapset *a);
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 14de6a9ff1..c7e1711147 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -36,13 +36,11 @@ struct List;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
#else
#define BITS_PER_BITMAPWORD 32
typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
#endif
@@ -73,6 +71,20 @@ typedef enum
BMS_MULTIPLE /* >1 member */
} BMS_Membership;
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w) pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
+#define bmw_popcount(w) pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w) pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
+#define bmw_popcount(w) pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
/*
* function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 158ef73a2b..bf7588e075 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -32,6 +32,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+ int32 result = (int32) word & -((int32) word);
+ return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+ int64 result = (int64) word & -((int64) word);
+ return (uint64) result;
+}
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 86a9303bf5..4a5e776703 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3675,7 +3675,6 @@ shmem_request_hook_type
shmem_startup_hook_type
sig_atomic_t
sigjmp_buf
-signedbitmapword
sigset_t
size_t
slist_head
--
2.31.1
v31-0012-Revert-the-update-for-the-minimum-value-of-maint.patchapplication/octet-stream; name=v31-0012-Revert-the-update-for-the-minimum-value-of-maint.patchDownload
From 8080e74de8597b6e8567fbfce5dbd2771937287c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 8 Mar 2023 15:09:22 +0900
Subject: [PATCH v31 12/14] Revert the update for the minimum value of
maintenance_work_mem.
---
src/backend/postmaster/autovacuum.c | 6 +++---
src/backend/utils/misc/guc_tables.c | 2 +-
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 60caeae739..c0e2e00a7e 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3399,12 +3399,12 @@ check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
return true;
/*
- * We clamp manually-set values to at least 2MB. Since
+ * We clamp manually-set values to at least 1MB. Since
* maintenance_work_mem is always set to at least this value, do the same
* here.
*/
- if (*newval < 2048)
- *newval = 2048;
+ if (*newval < 1024)
+ *newval = 1024;
return true;
}
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 8a64614cd1..1c0583fe26 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2313,7 +2313,7 @@ struct config_int ConfigureNamesInt[] =
GUC_UNIT_KB
},
&maintenance_work_mem,
- 65536, 2048, MAX_KILOBYTES,
+ 65536, 1024, MAX_KILOBYTES,
NULL, NULL, NULL
},
--
2.31.1
v31-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchapplication/octet-stream; name=v31-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchDownload
From 2176fc0e5b4bee9e389f8a29637ef9ed29aec0da Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v31 01/14] Introduce helper SIMD functions for small byte
arrays
vector8_min - helper for emulating ">=" semantics
vector8_highbit_mask - used to turn the result of a vector
comparison into a bitmask
Masahiko Sawada
Reviewed by Nathan Bossart, additional adjustments by me
Discussion: https://www.postgresql.org/message-id/CAD21AoDap240WDDdUDE0JMpCmuMMnGajrKrkCRxM7zn9Xk3JRA%40mail.gmail.com
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 1fa6c3bc6c..dfae14e463 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -79,6 +79,7 @@ static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#endif
/* arithmetic operations */
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -299,6 +301,36 @@ vector32_is_highbit_set(const Vector32 v)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Return a bitmask formed from the high-bit of each element.
+ */
+#ifndef USE_NO_SIMD
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ /*
+ * Note: There is a faster way to do this, but it returns a uint64 and
+ * and if the caller wanted to extract the bit position using CTZ,
+ * it would have to divide that result by 4.
+ */
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
/*
* Return the bitwise OR of the inputs
*/
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Given two vectors, return a vector with the minimum element of each.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.31.1
v31-0010-Radix-tree-optionally-tracks-memory-usage-when-R.patchapplication/octet-stream; name=v31-0010-Radix-tree-optionally-tracks-memory-usage-when-R.patchDownload
From 7da5e7808ba51aed7ad22b9758b3200cbfcd7d19 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 8 Mar 2023 15:08:19 +0900
Subject: [PATCH v31 10/14] Radix tree optionally tracks memory usage, when
RT_MEASURE_MEMORY_USAGE.
---
contrib/bench_radix_tree/bench_radix_tree.c | 1 +
src/backend/utils/mmgr/dsa.c | 12 ---
src/include/lib/radixtree.h | 93 +++++++++++++++++--
src/include/utils/dsa.h | 1 -
.../modules/test_radixtree/test_radixtree.c | 1 +
5 files changed, 85 insertions(+), 23 deletions(-)
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 6e5149e2c4..8a0c754a2c 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -34,6 +34,7 @@ PG_MODULE_MAGIC;
#define RT_DECLARE
#define RT_DEFINE
#define RT_USE_DELETE
+#define RT_MEASURE_MEMORY_USAGE
#define RT_VALUE_TYPE uint64
// WIP: compiles with warnings because rt_attach is defined but not used
// #define RT_SHMEM
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 80555aefff..f5a62061a3 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,18 +1024,6 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
-size_t
-dsa_get_total_size(dsa_area *area)
-{
- size_t size;
-
- LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
- size = area->control->total_segment_size;
- LWLockRelease(DSA_AREA_LOCK(area));
-
- return size;
-}
-
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 8bea606c62..f7812eb12a 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -84,7 +84,6 @@
* RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
* RT_ITERATE_NEXT - Return next key-value pair, if any
* RT_END_ITERATE - End iteration
- * RT_MEMORY_USAGE - Get the memory usage
*
* Interface for Shared Memory
* ---------
@@ -97,6 +96,8 @@
* ---------
*
* RT_DELETE - Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ * RT_MEMORY_USAGE - Get the memory usage. Declared/define if
+ * RT_MEASURE_MEMORY_USAGE is defined.
*
*
* Copyright (c) 2023, PostgreSQL Global Development Group
@@ -138,7 +139,9 @@
#ifdef RT_USE_DELETE
#define RT_DELETE RT_MAKE_NAME(delete)
#endif
+#ifdef RT_MEASURE_MEMORY_USAGE
#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#endif
#ifdef RT_DEBUG
#define RT_DUMP RT_MAKE_NAME(dump)
#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
@@ -150,6 +153,9 @@
#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#ifdef RT_MEASURE_MEMORY_USAGE
+#define RT_FANOUT_GET_NODE_SIZE RT_MAKE_NAME(fanout_get_node_size)
+#endif
#define RT_FREE_NODE RT_MAKE_NAME(free_node)
#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
#define RT_EXTEND_UP RT_MAKE_NAME(extend_up)
@@ -255,7 +261,9 @@ RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+#ifdef RT_MEASURE_MEMORY_USAGE
RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+#endif
#ifdef RT_DEBUG
RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
@@ -624,6 +632,10 @@ typedef struct RT_RADIX_TREE_CONTROL
uint64 max_val;
uint64 num_keys;
+#ifdef RT_MEASURE_MEMORY_USAGE
+ int64 mem_used;
+#endif
+
/* statistics */
#ifdef RT_DEBUG
int32 cnt[RT_SIZE_CLASS_COUNT];
@@ -1089,6 +1101,11 @@ RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
allocsize);
#endif
+#ifdef RT_MEASURE_MEMORY_USAGE
+ /* update memory usage */
+ tree->ctl->mem_used += allocsize;
+#endif
+
#ifdef RT_DEBUG
/* update the statistics */
tree->ctl->cnt[size_class]++;
@@ -1165,6 +1182,54 @@ RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL no
return newnode;
}
+#ifdef RT_MEASURE_MEMORY_USAGE
+/* Return the node size of the given fanout of the size class */
+static inline Size
+RT_FANOUT_GET_NODE_SIZE(int fanout, bool is_leaf)
+{
+ const Size fanout_inner_node_size[] = {
+ [3] = RT_SIZE_CLASS_INFO[RT_CLASS_3].inner_size,
+ [15] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN].inner_size,
+ [32] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX].inner_size,
+ [125] = RT_SIZE_CLASS_INFO[RT_CLASS_125].inner_size,
+ [256] = RT_SIZE_CLASS_INFO[RT_CLASS_256].inner_size,
+ };
+ const Size fanout_leaf_node_size[] = {
+ [3] = RT_SIZE_CLASS_INFO[RT_CLASS_3].leaf_size,
+ [15] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN].leaf_size,
+ [32] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX].leaf_size,
+ [125] = RT_SIZE_CLASS_INFO[RT_CLASS_125].leaf_size,
+ [256] = RT_SIZE_CLASS_INFO[RT_CLASS_256].leaf_size,
+ };
+ Size node_size;
+
+ node_size = is_leaf ?
+ fanout_leaf_node_size[fanout] : fanout_inner_node_size[fanout];
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ Size assert_node_size = 0;
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+
+ if (size_class.fanout == fanout)
+ {
+ assert_node_size = is_leaf ?
+ size_class.leaf_size : size_class.inner_size;
+ break;
+ }
+ }
+
+ Assert(node_size == assert_node_size);
+ }
+#endif
+
+ return node_size;
+}
+#endif
+
/* Free the given node */
static void
RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
@@ -1197,11 +1262,22 @@ RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
}
#endif
+#ifdef RT_MEASURE_MEMORY_USAGE
+ /* update memory usage */
+ {
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+ tree->ctl->mem_used -= RT_FANOUT_GET_NODE_SIZE(node->fanout,
+ RT_NODE_IS_LEAF(node));
+ Assert(tree->ctl->mem_used >= 0);
+ }
+#endif
+
#ifdef RT_SHMEM
dsa_free(tree->dsa, allocnode);
#else
pfree(allocnode);
#endif
+
}
/* Update the parent's pointer when growing a node */
@@ -1989,27 +2065,23 @@ RT_END_ITERATE(RT_ITER *iter)
/*
* Return the statistics of the amount of memory used by the radix tree.
*/
+#ifdef RT_MEASURE_MEMORY_USAGE
RT_SCOPE uint64
RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
{
Size total = 0;
- RT_LOCK_SHARED(tree);
-
#ifdef RT_SHMEM
Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
- total = dsa_get_total_size(tree->dsa);
-#else
- for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
- {
- total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
- total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
- }
#endif
+ RT_LOCK_SHARED(tree);
+ total = tree->ctl->mem_used;
RT_UNLOCK(tree);
+
return total;
}
+#endif
/*
* Verify the radix tree node.
@@ -2476,6 +2548,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_NEW_ROOT
#undef RT_ALLOC_NODE
#undef RT_INIT_NODE
+#undef RT_FANOUT_GET_NODE_SIZE
#undef RT_FREE_NODE
#undef RT_FREE_RECURSE
#undef RT_EXTEND_UP
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 2af215484f..3ce4ee300a 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,7 +121,6 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
-extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 5a169854d9..19d286d84b 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -114,6 +114,7 @@ static const test_spec test_specs[] = {
#define RT_DECLARE
#define RT_DEFINE
#define RT_USE_DELETE
+#define RT_MEASURE_MEMORY_USAGE
#define RT_VALUE_TYPE TestValueType
/* #define RT_SHMEM */
#include "lib/radixtree.h"
--
2.31.1
v31-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchapplication/octet-stream; name=v31-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload
From db646cb1da4a21182028096e036b0f86d61e8ce8 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v31 04/14] Add TIDStore, to store sets of TIDs
(ItemPointerData) efficiently.
The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.
The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.
This includes a unit test module, in src/test/modules/test_tidstore.
---
doc/src/sgml/monitoring.sgml | 4 +
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 681 ++++++++++++++++++
src/backend/storage/lmgr/lwlock.c | 2 +
src/include/access/tidstore.h | 49 ++
src/include/storage/lwlock.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_tidstore/Makefile | 23 +
.../test_tidstore/expected/test_tidstore.out | 13 +
src/test/modules/test_tidstore/meson.build | 35 +
.../test_tidstore/sql/test_tidstore.sql | 7 +
.../test_tidstore/test_tidstore--1.0.sql | 8 +
.../modules/test_tidstore/test_tidstore.c | 226 ++++++
.../test_tidstore/test_tidstore.control | 4 +
16 files changed, 1057 insertions(+)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
create mode 100644 src/test/modules/test_tidstore/Makefile
create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
create mode 100644 src/test/modules/test_tidstore/meson.build
create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
create mode 100644 src/test/modules/test_tidstore/test_tidstore.control
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6249bb50d0..97d588b1d8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2203,6 +2203,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Waiting to access a shared TID bitmap during a parallel bitmap
index scan.</entry>
</row>
+ <row>
+ <entry><literal>SharedTidStore</literal></entry>
+ <entry>Waiting to access a shared TID store.</entry>
+ </row>
<row>
<entry><literal>SharedTupleStore</literal></entry>
<entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..8c05e60d92
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,681 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach().
+ *
+ * Regarding the concurrency, it basically relies on the concurrency support in
+ * the radix tree, but we acquires the lock on a TidStore in some cases, for
+ * example, when to reset the store and when to access the number tids in the
+ * store (num_tids).
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, tids are represented as a pair of 64-bit key and
+ * 64-bit value. First, we construct 64-bit unsigned integer by combining
+ * the block number and the offset number. The number of bits used for the
+ * offset number is specified by max_offsets in tidstore_create(). We are
+ * frugal with the bits, because smaller keys could help keeping the radix
+ * tree shallow.
+ *
+ * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. That
+ * is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits
+ * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
+ * as the key:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ * |----| value
+ * |---------------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ */
+#define TIDSTORE_VALUE_NBITS 6 /* log(64, 2) */
+#define TIDSTORE_OFFSET_MASK ((1 << TIDSTORE_VALUE_NBITS) - 1)
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The control object for a TidStore */
+typedef struct TidStoreControl
+{
+ /* the number of tids in the store */
+ int64 num_tids;
+
+ /* These values are never changed after creation */
+ size_t max_bytes; /* the maximum bytes a TidStore can use */
+ int max_offset; /* the maximum offset number */
+ int offset_nbits; /* the number of bits required for an offset
+ * number */
+ int offset_key_nbits; /* the number of bits of an offset number
+ * used in a key */
+
+ /* The below fields are used only in shared case */
+
+ uint32 magic;
+ LWLock lock;
+
+ /* handles for TidStore and radix tree */
+ tidstore_handle handle;
+ shared_rt_handle tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+ /*
+ * Control object. This is allocated in DSA area 'area' in the shared
+ * case, otherwise in backend-local memory.
+ */
+ TidStoreControl *control;
+
+ /* Storage for Tids. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ local_rt_radix_tree *local;
+ shared_rt_radix_tree *shared;
+ } tree;
+
+ /* DSA area for TidStore if used */
+ dsa_area *area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+ TidStore *ts;
+
+ /* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ shared_rt_iter *shared;
+ local_rt_iter *local;
+ } tree_iter;
+
+ /* we returned all tids? */
+ bool finished;
+
+ /* save for the next iteration */
+ uint64 next_key;
+ uint64 next_val;
+
+ /* output for the caller */
+ TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+{
+ TidStore *ts;
+
+ ts = palloc0(sizeof(TidStore));
+
+ /*
+ * Create the radix tree for the main storage.
+ *
+ * Memory consumption depends on the number of stored tids, but also on the
+ * distribution of them, how the radix tree stores, and the memory management
+ * that backed the radix tree. The maximum bytes that a TidStore can
+ * use is specified by the max_bytes in tidstore_create(). We want the total
+ * amount of memory consumption by a TidStore not to exceed the max_bytes.
+ *
+ * In local TidStore cases, the radix tree uses slab allocators for each kind
+ * of node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+ * slab block for a new radix tree node, which is approximately 70kB. Therefore,
+ * we deduct 70kB from the max_bytes.
+ *
+ * In shared cases, DSA allocates the memory segments big enough to follow
+ * a geometric series that approximately doubles the total DSA size (see
+ * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+ * size and the simulation revealed, the 75% threshold for the maximum bytes
+ * perfectly works in case where the max_bytes is a power-of-2, and the 60%
+ * threshold works for other cases.
+ */
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+ float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+ LWTRANCHE_SHARED_TIDSTORE);
+
+ dp = dsa_allocate0(area, sizeof(TidStoreControl));
+ ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+ ts->control->max_bytes = (uint64) (max_bytes * ratio);
+ ts->area = area;
+
+ ts->control->magic = TIDSTORE_MAGIC;
+ LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+ ts->control->handle = dp;
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+ }
+ else
+ {
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+ ts->control->max_bytes = max_bytes - (70 * 1024);
+ }
+
+ ts->control->max_offset = max_offset;
+ ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+
+ if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
+ ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
+
+ ts->control->offset_key_nbits =
+ ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+
+ return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ /* create per-backend state */
+ ts = palloc0(sizeof(TidStore));
+
+ /* Find the control object in shared memory */
+ control = handle;
+
+ /* Set up the TidStore */
+ ts->control = (TidStoreControl *) dsa_get_address(area, control);
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+ ts->area = area;
+
+ return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ shared_rt_detach(ts->tree.shared);
+ pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix
+ * tree.
+ */
+ ts->control->magic = 0;
+ dsa_free(ts->area, ts->control->handle);
+ shared_rt_free(ts->tree.shared);
+ }
+ else
+ {
+ pfree(ts->control);
+ local_rt_free(ts->tree.local);
+ }
+
+ pfree(ts);
+}
+
+/*
+ * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * entire TidStore but recreate only the radix tree storage.
+ */
+void
+tidstore_reset(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /*
+ * Free the radix tree and return allocated DSA segments to
+ * the operating system.
+ */
+ shared_rt_free(ts->tree.shared);
+ dsa_trim(ts->area);
+
+ /* Recreate the radix tree */
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+ LWTRANCHE_SHARED_TIDSTORE);
+
+ /* update the radix tree handle as we recreated it */
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+
+ LWLockRelease(&ts->control->lock);
+ }
+ else
+ {
+ local_rt_free(ts->tree.local);
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+ }
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ uint64 *values;
+ uint64 key;
+ uint64 prev_key;
+ uint64 off_bitmap = 0;
+ int idx;
+ const uint64 key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+ const int nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ values = palloc(sizeof(uint64) * nkeys);
+ key = prev_key = key_base;
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint64 off_bit;
+
+ /* encode the tid to a key and partial offset */
+ key = encode_key_off(ts, blkno, offsets[i], &off_bit);
+
+ /* make sure we scanned the line pointer array in order */
+ Assert(key >= prev_key);
+
+ if (key > prev_key)
+ {
+ idx = prev_key - key_base;
+ Assert(idx >= 0 && idx < nkeys);
+
+ /* write out offset bitmap for this key */
+ values[idx] = off_bitmap;
+
+ /* zero out any gaps up to the current key */
+ for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
+ values[empty_idx] = 0;
+
+ /* reset for current key -- the current offset will be handled below */
+ off_bitmap = 0;
+ prev_key = key;
+ }
+
+ off_bitmap |= off_bit;
+ }
+
+ /* save the final index for later */
+ idx = key - key_base;
+ /* write out last offset bitmap */
+ values[idx] = off_bitmap;
+
+ if (TidStoreIsShared(ts))
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /* insert the calculated key-values to the tree */
+ for (int i = 0; i <= idx; i++)
+ {
+ if (values[i])
+ {
+ key = key_base + i;
+
+ if (TidStoreIsShared(ts))
+ shared_rt_set(ts->tree.shared, key, &values[i]);
+ else
+ local_rt_set(ts->tree.local, key, &values[i]);
+ }
+ }
+
+ /* update statistics */
+ ts->control->num_tids += num_offsets;
+
+ if (TidStoreIsShared(ts))
+ LWLockRelease(&ts->control->lock);
+
+ pfree(values);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val = 0;
+ uint64 off_bit;
+ bool found;
+
+ key = tid_to_key_off(ts, tid, &off_bit);
+
+ if (TidStoreIsShared(ts))
+ found = shared_rt_search(ts->tree.shared, key, &val);
+ else
+ found = local_rt_search(ts->tree.local, key, &val);
+
+ if (!found)
+ return false;
+
+ return (val & off_bit) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during the
+ * iteration, so tidstore_end_iterate() needs to called when finished.
+ *
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+ TidStoreIter *iter;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ iter = palloc0(sizeof(TidStoreIter));
+ iter->ts = ts;
+
+ iter->result.blkno = InvalidBlockNumber;
+ iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
+
+ if (TidStoreIsShared(ts))
+ iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+ else
+ iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+ /* If the TidStore is empty, there is no business */
+ if (tidstore_num_tids(ts) == 0)
+ iter->finished = true;
+
+ return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+
+ return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a pointer to TidStoreIterResult that has tids
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+ TidStoreIterResult *result = &(iter->result);
+
+ if (iter->finished)
+ return NULL;
+
+ if (BlockNumberIsValid(result->blkno))
+ {
+ /* Process the previously collected key-value */
+ result->num_offsets = 0;
+ tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (tidstore_iter_kv(iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = key_get_blkno(iter->ts, key);
+
+ if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+ {
+ /*
+ * We got a key-value pair for a different block. So return the
+ * collected tids, and remember the key-value for the next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+ return result;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_extract_tids(iter, key, val);
+ }
+
+ iter->finished = true;
+ return result;
+}
+
+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+ if (TidStoreIsShared(iter->ts))
+ shared_rt_end_iterate(iter->tree_iter.shared);
+ else
+ local_rt_end_iterate(iter->tree_iter.local);
+
+ pfree(iter->result.offsets);
+ pfree(iter);
+}
+
+/* Return the number of tids we collected so far */
+int64
+tidstore_num_tids(TidStore *ts)
+{
+ uint64 num_tids;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ if (!TidStoreIsShared(ts))
+ return ts->control->num_tids;
+
+ LWLockAcquire(&ts->control->lock, LW_SHARED);
+ num_tids = ts->control->num_tids;
+ LWLockRelease(&ts->control->lock);
+
+ return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+tidstore_max_memory(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+tidstore_memory_usage(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+ return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->handle;
+}
+
+/* Extract tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+ TidStoreIterResult *result = (&iter->result);
+
+ while (val)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= pg_rightmost_one_pos64(val);
+
+ off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+
+ Assert(result->num_offsets < iter->ts->control->max_offset);
+ result->offsets[result->num_offsets++] = off;
+
+ /* unset the rightmost bit */
+ val &= ~pg_rightmost_one64(val);
+ }
+
+ result->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, uint64 key)
+{
+ return (BlockNumber) (key >> ts->control->offset_key_nbits);
+}
+
+/* Encode a tid to key and offset */
+static inline uint64
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit)
+{
+ uint32 offset = ItemPointerGetOffsetNumber(tid);
+ BlockNumber block = ItemPointerGetBlockNumber(tid);
+
+ return encode_key_off(ts, block, offset, off_bit);
+}
+
+/* encode a block and offset to a key and partial offset */
+static inline uint64
+encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit)
+{
+ uint64 key;
+ uint64 tid_i;
+ uint32 off_lower;
+
+ off_lower = offset & TIDSTORE_OFFSET_MASK;
+ Assert(off_lower < (sizeof(uint64) * BITS_PER_BYTE));
+
+ *off_bit = UINT64CONST(1) << off_lower;
+ tid_i = offset | ((uint64) block << ts->control->offset_nbits);
+ key = tid_i >> TIDSTORE_VALUE_NBITS;
+
+ return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
"SharedTupleStore",
/* LWTRANCHE_SHARED_TIDBITMAP: */
"SharedTidBitmap",
+ /* LWTRANCHE_SHARED_TIDSTORE: */
+ "SharedTidStore",
/* LWTRANCHE_PARALLEL_APPEND: */
"ParallelAppend",
/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..a35a52124a
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+ BlockNumber blkno;
+ OffsetNumber *offsets;
+ int num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern int64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern size_t tidstore_max_memory(TidStore *ts);
+extern size_t tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif /* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
LWTRANCHE_SHARED_TUPLESTORE,
LWTRANCHE_SHARED_TIDBITMAP,
+ LWTRANCHE_SHARED_TIDSTORE,
LWTRANCHE_PARALLEL_APPEND,
LWTRANCHE_PER_XACT_PREDICATE_LIST,
LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
test_rls_hooks \
test_shm_mq \
test_slru \
+ test_tidstore \
unsafe_tests \
worker_spi
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
subdir('test_rls_hooks')
subdir('test_shm_mq')
subdir('test_slru')
+subdir('test_tidstore')
subdir('unsafe_tests')
subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+ $(WIN32RES) \
+ test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE: testing empty tidstore
+NOTICE: testing basic operations
+ test_tidstore
+---------------
+
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+ 'test_tidstore.c',
+)
+
+if host_system == 'windows'
+ test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_tidstore',
+ '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+ test_tidstore_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+ 'test_tidstore.control',
+ 'test_tidstore--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_tidstore',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_tidstore',
+ ],
+ },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..9a1217f833
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,226 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ * Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+/* #define TEST_SHARED_TIDSTORE 1 */
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+ ItemPointerData tid;
+ bool found;
+
+ ItemPointerSet(&tid, blkno, off);
+
+ found = tidstore_lookup_tid(ts, &tid);
+
+ if (found != expect)
+ elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+ blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS 5
+#define TEST_TIDSTORE_NUM_OFFSETS 5
+
+ TidStore *ts;
+ TidStoreIter *iter;
+ TidStoreIterResult *iter_result;
+ BlockNumber blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+ };
+ BlockNumber blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+ };
+ OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+ int blk_idx;
+
+#ifdef TEST_SHARED_TIDSTORE
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_tidstore");
+ dsa = dsa_create(tranche_id);
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
+#else
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+#endif
+
+ /* prepare the offset array */
+ offs[0] = FirstOffsetNumber;
+ offs[1] = FirstOffsetNumber + 1;
+ offs[2] = max_offset / 2;
+ offs[3] = max_offset - 1;
+ offs[4] = max_offset;
+
+ /* add tids */
+ for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+ tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* lookup test */
+ for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+ {
+ bool expect = false;
+ for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+ {
+ if (offs[i] == off)
+ {
+ expect = true;
+ break;
+ }
+ }
+
+ check_tid(ts, 0, off, expect);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, expect);
+ }
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+ elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+ tidstore_num_tids(ts),
+ TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* iteration test */
+ iter = tidstore_begin_iterate(ts);
+ blk_idx = 0;
+ while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+ {
+ /* check the returned block number */
+ if (blks_sorted[blk_idx] != iter_result->blkno)
+ elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+ iter_result->blkno, blks_sorted[blk_idx]);
+
+ /* check the returned offset numbers */
+ if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+ elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+ iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+ for (int i = 0; i < iter_result->num_offsets; i++)
+ {
+ if (offs[i] != iter_result->offsets[i])
+ elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+ iter_result->offsets[i], iter_result->blkno, offs[i]);
+ }
+
+ blk_idx++;
+ }
+
+ if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+ elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+ blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+ /* remove all tids */
+ tidstore_reset(ts);
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+ /* lookup test for empty store */
+ for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+ off++)
+ {
+ check_tid(ts, 0, off, false);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, false);
+ }
+
+ tidstore_destroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+ dsa_detach(dsa);
+#endif
+}
+
+static void
+test_empty(void)
+{
+ TidStore *ts;
+ TidStoreIter *iter;
+ ItemPointerData tid;
+
+#ifdef TEST_SHARED_TIDSTORE
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_tidstore");
+ dsa = dsa_create(tranche_id);
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
+#else
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+#endif
+
+ elog(NOTICE, "testing empty tidstore");
+
+ ItemPointerSet(&tid, 0, FirstOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+ ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+ MaxBlockNumber, MaxOffsetNumber);
+
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+ if (tidstore_is_full(ts))
+ elog(ERROR, "tidstore_is_full on empty store returned true");
+
+ iter = tidstore_begin_iterate(ts);
+
+ if (tidstore_iterate_next(iter) != NULL)
+ elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+ tidstore_end_iterate(iter);
+
+ tidstore_destroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+ dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ elog(NOTICE, "testing basic operations");
+ test_basic(MaxHeapTuplesPerPage);
+ test_basic(10);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
--
2.31.1
On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Fri, Mar 10, 2023 at 3:42 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
I'd suggest sharing your todo list in the meanwhile, it'd be good to
discuss what's worth doing and what is not.
Apart from more rounds of reviews and tests, my todo items that need
discussion and possibly implementation are:
Quick thoughts on these:
* The memory measurement in radix trees and the memory limit in
tidstores. I've implemented it in v30-0007 through 0009 but we need to
review it. This is the highest priority for me.
Agreed.
* Additional size classes. It's important for an alternative of path
compression as well as supporting our decoupling approach. Middle
priority.
I'm going to push back a bit and claim this doesn't bring much gain, while
it does have a complexity cost. The node1 from Andres's prototype is 32
bytes in size, same as our node3, so it's roughly equivalent as a way to
ameliorate the lack of path compression. I say "roughly" because the loop
in node3 is probably noticeably slower. A new size class will by definition
still use that loop.
About a smaller node125-type class: I'm actually not even sure we need to
have any sub-max node bigger about 64 (node size 768 bytes). I'd just let
65+ go to the max node -- there won't be many of them, at least in
synthetic workloads we've seen so far.
* Node shrinking support. Low priority.
This is an architectural wart that's been neglected since the tid store
doesn't perform deletion. We'll need it sometime. If we're not going to
make this work, why ship a deletion API at all?
I took a look at this a couple weeks ago, and fixing it wouldn't be that
hard. I even had an idea of how to detect when to shrink size class within
a node kind, while keeping the header at 5 bytes. I'd be willing to put
effort into that, but to have a chance of succeeding, I'm unwilling to make
it more difficult by adding more size classes at this point.
--
John Naylor
EDB: http://www.enterprisedb.com
On Sun, Mar 12, 2023 at 12:54 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Mar 10, 2023 at 3:42 PM John Naylor
<john.naylor@enterprisedb.com> wrote:I'd suggest sharing your todo list in the meanwhile, it'd be good to discuss what's worth doing and what is not.
Apart from more rounds of reviews and tests, my todo items that need
discussion and possibly implementation are:Quick thoughts on these:
* The memory measurement in radix trees and the memory limit in
tidstores. I've implemented it in v30-0007 through 0009 but we need to
review it. This is the highest priority for me.Agreed.
* Additional size classes. It's important for an alternative of path
compression as well as supporting our decoupling approach. Middle
priority.I'm going to push back a bit and claim this doesn't bring much gain, while it does have a complexity cost. The node1 from Andres's prototype is 32 bytes in size, same as our node3, so it's roughly equivalent as a way to ameliorate the lack of path compression.
But does it mean that our node1 would help reduce the memory further
since since our base node type (i.e. RT_NODE) is smaller than the base
node type of Andres's prototype? The result I shared before showed
1.2GB vs. 1.9GB.
I say "roughly" because the loop in node3 is probably noticeably slower. A new size class will by definition still use that loop.
I've evaluated the performance of node1 but the result seems to show
the opposite. I used the test query:
select * from bench_search_random_nodes(100 * 1000 * 1000,
'0xFF000000000000FF');
Which make the radix tree that has node1 like:
max_val = 18446744073709551615
num_keys = 65536
height = 7, n1 = 1536, n3 = 0, n15 = 0, n32 = 0, n61 = 0, n256 = 257
All internal nodes except for the root node are node1. The radix tree
that doesn't have node1 is:
max_val = 18446744073709551615
num_keys = 65536
height = 7, n3 = 1536, n15 = 0, n32 = 0, n125 = 0, n256 = 257
Here is the result:
* w/ node1
mem_allocated | load_ms | search_ms
---------------+---------+-----------
573448 | 1848 | 1707
(1 row)
* w/o node1
mem_allocated | load_ms | search_ms
---------------+---------+-----------
598024 | 2014 | 1825
(1 row)
Am I missing something?
About a smaller node125-type class: I'm actually not even sure we need to have any sub-max node bigger about 64 (node size 768 bytes). I'd just let 65+ go to the max node -- there won't be many of them, at least in synthetic workloads we've seen so far.
Makes sense to me.
* Node shrinking support. Low priority.
This is an architectural wart that's been neglected since the tid store doesn't perform deletion. We'll need it sometime. If we're not going to make this work, why ship a deletion API at all?
I took a look at this a couple weeks ago, and fixing it wouldn't be that hard. I even had an idea of how to detect when to shrink size class within a node kind, while keeping the header at 5 bytes. I'd be willing to put effort into that, but to have a chance of succeeding, I'm unwilling to make it more difficult by adding more size classes at this point.
I think that the deletion (and locking support) doesn't have use cases
in the core (i.e. tidstore) but is implemented so that external
extensions can use it. There might not be such extensions. Given the
lack of use cases in the core (and the rest of time), I think it's
okay even if the implementation of such API is minimal and not
optimized enough. For instance, the implementation of dshash.c is
minimalist, and doesn't have resizing. We can improve them in the
future if extensions or other core features want.
Personally I think we should focus on addressing feedback that we
would get and improving the existing use cases for the rest of time.
That's why considering min-max size class has a higher priority than
the node shrinking support in my todo list.
FYI, I've run TPC-C workload over the weekend, and didn't get any
failures of the assertion proving tidstore and the current tid lookup
return the same result.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Mon, Mar 13, 2023 at 8:41 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Sun, Mar 12, 2023 at 12:54 AM John Naylor
<john.naylor@enterprisedb.com> wrote:On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
* Additional size classes. It's important for an alternative of path
compression as well as supporting our decoupling approach. Middle
priority.I'm going to push back a bit and claim this doesn't bring much gain,
while it does have a complexity cost. The node1 from Andres's prototype is
32 bytes in size, same as our node3, so it's roughly equivalent as a way to
ameliorate the lack of path compression.
But does it mean that our node1 would help reduce the memory further
since since our base node type (i.e. RT_NODE) is smaller than the base
node type of Andres's prototype? The result I shared before showed
1.2GB vs. 1.9GB.
The benefit is found in a synthetic benchmark with random integers. I
highly doubt that anyone would be willing to force us to keep
binary-searching the 1GB array for one more cycle on account of not adding
a size class here. I'll repeat myself and say that there are also
maintenance costs.
In contrast, I'm fairly certain that our attempts thus far at memory
accounting/limiting are not quite up to par, and lacking enough to
jeopardize the feature. We're already discussing that, so I'll say no more.
I say "roughly" because the loop in node3 is probably noticeably
slower. A new size class will by definition still use that loop.
I've evaluated the performance of node1 but the result seems to show
the opposite.
As an aside, I meant the loop in our node3 might make your node1 slower
than the prototype's node1, which was coded for 1 member only.
* Node shrinking support. Low priority.
This is an architectural wart that's been neglected since the tid store
doesn't perform deletion. We'll need it sometime. If we're not going to
make this work, why ship a deletion API at all?
I took a look at this a couple weeks ago, and fixing it wouldn't be
that hard. I even had an idea of how to detect when to shrink size class
within a node kind, while keeping the header at 5 bytes. I'd be willing to
put effort into that, but to have a chance of succeeding, I'm unwilling to
make it more difficult by adding more size classes at this point.
I think that the deletion (and locking support) doesn't have use cases
in the core (i.e. tidstore) but is implemented so that external
extensions can use it.
I think these cases are a bit different: Doing anything with a data
structure stored in shared memory without a synchronization scheme is
completely unthinkable and insane. I'm not yet sure if
deleting-without-shrinking is a showstopper, or if it's preferable in v16
than no deletion at all.
Anything we don't implement now is a limit on future use cases, and thus a
cause for objection. On the other hand, anything we implement also
represents more stuff that will have to be rewritten for high-concurrency.
FYI, I've run TPC-C workload over the weekend, and didn't get any
failures of the assertion proving tidstore and the current tid lookup
return the same result.
Great!
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Mar 13, 2023 at 10:28 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Mon, Mar 13, 2023 at 8:41 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sun, Mar 12, 2023 at 12:54 AM John Naylor
<john.naylor@enterprisedb.com> wrote:On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
* Additional size classes. It's important for an alternative of path
compression as well as supporting our decoupling approach. Middle
priority.I'm going to push back a bit and claim this doesn't bring much gain, while it does have a complexity cost. The node1 from Andres's prototype is 32 bytes in size, same as our node3, so it's roughly equivalent as a way to ameliorate the lack of path compression.
But does it mean that our node1 would help reduce the memory further
since since our base node type (i.e. RT_NODE) is smaller than the base
node type of Andres's prototype? The result I shared before showed
1.2GB vs. 1.9GB.The benefit is found in a synthetic benchmark with random integers. I highly doubt that anyone would be willing to force us to keep binary-searching the 1GB array for one more cycle on account of not adding a size class here. I'll repeat myself and say that there are also maintenance costs.
In contrast, I'm fairly certain that our attempts thus far at memory accounting/limiting are not quite up to par, and lacking enough to jeopardize the feature. We're already discussing that, so I'll say no more.
I agree that memory accounting/limiting stuff is the highest priority.
So what kinds of size classes do you think we need? node3, 15, 32, 61
and 256?
I say "roughly" because the loop in node3 is probably noticeably slower. A new size class will by definition still use that loop.
I've evaluated the performance of node1 but the result seems to show
the opposite.As an aside, I meant the loop in our node3 might make your node1 slower than the prototype's node1, which was coded for 1 member only.
Agreed.
* Node shrinking support. Low priority.
This is an architectural wart that's been neglected since the tid store doesn't perform deletion. We'll need it sometime. If we're not going to make this work, why ship a deletion API at all?
I took a look at this a couple weeks ago, and fixing it wouldn't be that hard. I even had an idea of how to detect when to shrink size class within a node kind, while keeping the header at 5 bytes. I'd be willing to put effort into that, but to have a chance of succeeding, I'm unwilling to make it more difficult by adding more size classes at this point.
I think that the deletion (and locking support) doesn't have use cases
in the core (i.e. tidstore) but is implemented so that external
extensions can use it.I think these cases are a bit different: Doing anything with a data structure stored in shared memory without a synchronization scheme is completely unthinkable and insane.
Right.
I'm not yet sure if deleting-without-shrinking is a showstopper, or if it's preferable in v16 than no deletion at all.
Anything we don't implement now is a limit on future use cases, and thus a cause for objection. On the other hand, anything we implement also represents more stuff that will have to be rewritten for high-concurrency.
Okay. Given that adding shrinking support also requires maintenance
costs (and probably new test cases?) and there are no use cases in the
core, I'm not sure it's worth supporting it at this stage. So I prefer
either shipping the deletion API as it is and removing the deletion
API. I think that it's a discussion point that we'd like to hear
feedback from other hackers.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
I wrote:
Since the block-level measurement is likely overestimating quite a
bit, I propose to simply reverse the order of the actions here, effectively
reporting progress for the *last page* and not the current one: First
update progress with the current memory usage, then add tids for this page.
If this allocated a new block, only a small bit of that will be written to.
If this block pushes it over the limit, we will detect that up at the top
of the loop. It's kind of like our earlier attempts at a "fudge factor",
but simpler and less brittle. And, as far as OS pages we have actually
written to, I think it'll effectively respect the memory limit, at least in
the local mem case. And the numbers will make sense.
Thoughts?
It looks to work but it still doesn't work in a case where a shared
tidstore is created with a 64kB memory limit, right?
TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
from the beginning.I have two ideas:
1. Make it optional to track chunk memory space by a template parameter.
It might be tiny compared to everything else that vacuum does. That would
allow other users to avoid that overhead.
2. When context block usage exceeds the limit (rare), make the additional
effort to get the precise usage -- I'm not sure such a top-down facility
exists, and I'm not feeling well enough today to study this further.
Since then, Masahiko incorporated #1 into v31, and that's what I'm looking
at now. Unfortunately, If I had spent five minutes reminding myself what
the original objections were to this approach, I could have saved us some
effort. Back in July (!), Andres raised two points: GetMemoryChunkSpace()
is slow [1]/messages/by-id/20220704211822.kfxtzpcdmslzm2dy@awork3.anarazel.de, and fragmentation [2]/messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de (leading to underestimation).
In v31, in the local case at least, the underestimation is actually worse
than tracking chunk space, since it ignores chunk header and alignment.
I'm not sure about the DSA case. This doesn't seem great.
It shouldn't be a surprise why a simple increment of raw allocation size is
comparable in speed -- GetMemoryChunkSpace() calls the right function
through a pointer, which is slower. If we were willing to underestimate for
the sake of speed, that takes away the reason for making memory tracking
optional.
Further, if the option is not specified, in v31 there is no way to get the
memory use at all, which seems odd. Surely the caller should be able to ask
the context/area, if it wants to.
I still like my idea at the top of the page -- at least for vacuum and
m_w_m. It's still not completely clear if it's right but I've got nothing
better. It also ignores the work_mem issue, but I've given up anticipating
all future cases at the moment.
I'll put this item and a couple other things together in a separate email
tomorrow.
[1]: /messages/by-id/20220704211822.kfxtzpcdmslzm2dy@awork3.anarazel.de
/messages/by-id/20220704211822.kfxtzpcdmslzm2dy@awork3.anarazel.de
[2]: /messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de
/messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Mar 14, 2023 at 8:27 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
I wrote:
Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
Thoughts?
It looks to work but it still doesn't work in a case where a shared
tidstore is created with a 64kB memory limit, right?
TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
from the beginning.I have two ideas:
1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything else that vacuum does. That would allow other users to avoid that overhead.
2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm not sure such a top-down facility exists, and I'm not feeling well enough today to study this further.Since then, Masahiko incorporated #1 into v31, and that's what I'm looking at now. Unfortunately, If I had spent five minutes reminding myself what the original objections were to this approach, I could have saved us some effort. Back in July (!), Andres raised two points: GetMemoryChunkSpace() is slow [1], and fragmentation [2] (leading to underestimation).
In v31, in the local case at least, the underestimation is actually worse than tracking chunk space, since it ignores chunk header and alignment. I'm not sure about the DSA case. This doesn't seem great.
Right.
It shouldn't be a surprise why a simple increment of raw allocation size is comparable in speed -- GetMemoryChunkSpace() calls the right function through a pointer, which is slower. If we were willing to underestimate for the sake of speed, that takes away the reason for making memory tracking optional.
Further, if the option is not specified, in v31 there is no way to get the memory use at all, which seems odd. Surely the caller should be able to ask the context/area, if it wants to.
There are precedents that don't provide a way to return memory usage,
such as simplehash.h and dshash.c.
I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if it's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future cases at the moment.
What does it mean by "the precise usage" in your idea? Quoting from
the email you referred to, Andres said:
---
One thing I was wondering about is trying to choose node types in
roughly-power-of-two struct sizes. It's pretty easy to end up with significant
fragmentation in the slabs right now when inserting as you go, because some of
the smaller node types will be freed but not enough to actually free blocks of
memory. If we instead have ~power-of-two sizes we could just use a single slab
of the max size, and carve out the smaller node types out of that largest
allocation.
Btw, that fragmentation is another reason why I think it's better to track
memory usage via memory contexts, rather than doing so based on
GetMemoryChunkSpace().
---
IIUC he suggested measuring memory usage in block-level in order to
count blocks that are not actually freed but some of its chunks are
freed. That's why we used MemoryContextMemAllocated(). On the other
hand, recently you pointed out[1]/messages/by-id/CAFBsxsEnzivaJ13iCGdDoUMsXJVGOaahuBe_y=q6ow=LTzyDvA@mail.gmail.com:
---
I think we're trying to solve the wrong problem here. I need to study
this more, but it seems that code that needs to stay within a memory
limit only needs to track what's been allocated in chunks within a
block, since writing there is what invokes a page fault.
---
IIUC you suggested measuring memory usage by tracking how much memory
chunks are allocated within a block. If your idea at the top of the
page follows this method, it still doesn't deal with the point Andres
mentioned.
I'll put this item and a couple other things together in a separate email tomorrow.
Thanks!
Regards,
[1]: /messages/by-id/CAFBsxsEnzivaJ13iCGdDoUMsXJVGOaahuBe_y=q6ow=LTzyDvA@mail.gmail.com
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Tue, Mar 14, 2023 at 8:27 PM John Naylor
<john.naylor@enterprisedb.com> wrote:I wrote:
Since the block-level measurement is likely overestimating quite
a bit, I propose to simply reverse the order of the actions here,
effectively reporting progress for the *last page* and not the current one:
First update progress with the current memory usage, then add tids for this
page. If this allocated a new block, only a small bit of that will be
written to. If this block pushes it over the limit, we will detect that up
at the top of the loop. It's kind of like our earlier attempts at a "fudge
factor", but simpler and less brittle. And, as far as OS pages we have
actually written to, I think it'll effectively respect the memory limit, at
least in the local mem case. And the numbers will make sense.
I still like my idea at the top of the page -- at least for vacuum and
m_w_m. It's still not completely clear if it's right but I've got nothing
better. It also ignores the work_mem issue, but I've given up anticipating
all future cases at the moment.
IIUC you suggested measuring memory usage by tracking how much memory
chunks are allocated within a block. If your idea at the top of the
page follows this method, it still doesn't deal with the point Andres
mentioned.
Right, but that idea was orthogonal to how we measure memory use, and in
fact mentions blocks specifically. The re-ordering was just to make sure
that progress reporting didn't show current-use > max-use.
However, the big question remains DSA, since a new segment can be as large
as the entire previous set of allocations. It seems it just wasn't designed
for things where memory growth is unpredictable.
I'm starting to wonder if we need to give DSA a bit more info at the start.
Imagine a "soft" limit given to the DSA area when it is initialized. If the
total segment usage exceeds this, it stops doubling and instead new
segments get smaller. Modifying an example we used for the fudge-factor
idea some time ago:
m_w_m = 1GB, so calculate the soft limit to be 512MB and pass it to the DSA
area.
2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> hit soft limit, so
"stairstep down" the new segment sizes:
766 + 2*(128) + 64 = 1086MB -> stop
That's just an undeveloped idea, however, so likely v17 development, even
assuming it's not a bad idea (could be).
And sadly, unless we find some other, simpler answer soon for tracking and
limiting shared memory, the tid store is looking like v17 material.
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Mar 17, 2023 at 4:03 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Mar 14, 2023 at 8:27 PM John Naylor
<john.naylor@enterprisedb.com> wrote:I wrote:
Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if it's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future cases at the moment.
IIUC you suggested measuring memory usage by tracking how much memory
chunks are allocated within a block. If your idea at the top of the
page follows this method, it still doesn't deal with the point Andres
mentioned.Right, but that idea was orthogonal to how we measure memory use, and in fact mentions blocks specifically. The re-ordering was just to make sure that progress reporting didn't show current-use > max-use.
Right. I still like your re-ordering idea. It's true that the most
area of the last allocated block before heap scanning stops is not
actually used yet. I'm guessing we can just check if the context
memory has gone over the limit. But I'm concerned it might not work
well in systems where overcommit memory is disabled.
However, the big question remains DSA, since a new segment can be as large as the entire previous set of allocations. It seems it just wasn't designed for things where memory growth is unpredictable.
I'm starting to wonder if we need to give DSA a bit more info at the start. Imagine a "soft" limit given to the DSA area when it is initialized. If the total segment usage exceeds this, it stops doubling and instead new segments get smaller. Modifying an example we used for the fudge-factor idea some time ago:
m_w_m = 1GB, so calculate the soft limit to be 512MB and pass it to the DSA area.
2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> hit soft limit, so "stairstep down" the new segment sizes:
766 + 2*(128) + 64 = 1086MB -> stop
That's just an undeveloped idea, however, so likely v17 development, even assuming it's not a bad idea (could be).
This is an interesting idea. But I'm concerned we don't have enough
time to get confident with adding this new concept to DSA.
And sadly, unless we find some other, simpler answer soon for tracking and limiting shared memory, the tid store is looking like v17 material.
Another problem we need to deal with is the supported minimum memory
in shared tidstore cases. Since the initial DSA segment size is 1MB,
memory usage of a shared tidstore will start from 1MB+. This is higher
than the minimum values of both work_mem and maintenance_work_mem,
64kB and 1MB respectively. Increasing the minimum m_w_m to 2MB seems
to be acceptable in the community but not for work_mem. One idea is to
deny the memory limit less than 2MB so it won't work with small m_w_m
settings. While it might be an acceptable restriction at this stage
(where there is no use case of using tidstore with work_mem in the
core) but it will be a blocker for the future adoptions such as
unifying with tidbitmap.c. Another idea is that the process can
specify the initial segment size at dsa_create() so that DSA can start
with a smaller segment, say 32kB. That way, a tidstore with a 32kB
limit gets full once it allocates the next DSA segment, 32kB. . But a
downside of this idea is to increase the number of segments behind
DSA. Assuming it's a relatively rare case where we use such a low
work_mem, it might be acceptable. FYI, the total number of DSM
segments available on the system is calculated by:
#define PG_DYNSHMEM_FIXED_SLOTS 64
#define PG_DYNSHMEM_SLOTS_PER_BACKEND 5
maxitems = PG_DYNSHMEM_FIXED_SLOTS
+ PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Fri, Mar 17, 2023 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Mar 17, 2023 at 4:03 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Mar 14, 2023 at 8:27 PM John Naylor
<john.naylor@enterprisedb.com> wrote:I wrote:
Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if it's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future cases at the moment.
IIUC you suggested measuring memory usage by tracking how much memory
chunks are allocated within a block. If your idea at the top of the
page follows this method, it still doesn't deal with the point Andres
mentioned.Right, but that idea was orthogonal to how we measure memory use, and in fact mentions blocks specifically. The re-ordering was just to make sure that progress reporting didn't show current-use > max-use.
Right. I still like your re-ordering idea. It's true that the most
area of the last allocated block before heap scanning stops is not
actually used yet. I'm guessing we can just check if the context
memory has gone over the limit. But I'm concerned it might not work
well in systems where overcommit memory is disabled.However, the big question remains DSA, since a new segment can be as large as the entire previous set of allocations. It seems it just wasn't designed for things where memory growth is unpredictable.
aset.c also has a similar characteristic; allocates an 8K block upon
the first allocation in a context, and doubles that size for each
successive block request. But we can specify the initial block size
and max blocksize. This made me think of another idea to specify both
to DSA and both values are calculated based on m_w_m. For example, we
can create a DSA in parallel_vacuum_init() as follows:
initial block size = min(m_w_m / 4, 1MB)
max block size = max(m_w_m / 8, 8MB)
In most cases, we can start with a 1MB initial segment, the same as
before. For small memory cases, say 1MB, we start with a 256KB initial
segment and heap scanning stops after DSA allocated 1.5MB (= 256kB +
256kB + 512kB + 512kB). For larger memory, we can have heap scan stop
after DSA allocates 1.25 times more memory than m_w_m. For example, if
m_w_m = 1GB, the both initial and maximum segment sizes are 1MB and
128MB respectively, and then DSA allocates the segments as follows
until heap scanning stops:
2 * (1 + 2 + 4 + 8 + 16 + 32 + 64 + 128) + (128 * 5) = 1150MB
dsa_allocate() will be extended to have the initial and maximum block
sizes like AllocSetContextCreate().
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Mon, Mar 20, 2023 at 12:25 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Fri, Mar 17, 2023 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Fri, Mar 17, 2023 at 4:03 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Tue, Mar 14, 2023 at 8:27 PM John Naylor
<john.naylor@enterprisedb.com> wrote:I wrote:
Since the block-level measurement is likely overestimating
quite a bit, I propose to simply reverse the order of the actions here,
effectively reporting progress for the *last page* and not the current one:
First update progress with the current memory usage, then add tids for this
page. If this allocated a new block, only a small bit of that will be
written to. If this block pushes it over the limit, we will detect that up
at the top of the loop. It's kind of like our earlier attempts at a "fudge
factor", but simpler and less brittle. And, as far as OS pages we have
actually written to, I think it'll effectively respect the memory limit, at
least in the local mem case. And the numbers will make sense.
I still like my idea at the top of the page -- at least for
vacuum and m_w_m. It's still not completely clear if it's right but I've
got nothing better. It also ignores the work_mem issue, but I've given up
anticipating all future cases at the moment.
IIUC you suggested measuring memory usage by tracking how much
memory
chunks are allocated within a block. If your idea at the top of the
page follows this method, it still doesn't deal with the point
Andres
mentioned.
Right, but that idea was orthogonal to how we measure memory use, and
in fact mentions blocks specifically. The re-ordering was just to make sure
that progress reporting didn't show current-use > max-use.
Right. I still like your re-ordering idea. It's true that the most
area of the last allocated block before heap scanning stops is not
actually used yet. I'm guessing we can just check if the context
memory has gone over the limit. But I'm concerned it might not work
well in systems where overcommit memory is disabled.However, the big question remains DSA, since a new segment can be as
large as the entire previous set of allocations. It seems it just wasn't
designed for things where memory growth is unpredictable.
aset.c also has a similar characteristic; allocates an 8K block upon
the first allocation in a context, and doubles that size for each
successive block request. But we can specify the initial block size
and max blocksize. This made me think of another idea to specify both
to DSA and both values are calculated based on m_w_m. For example, we
That's an interesting idea, and the analogous behavior to aset could be a
good thing for readability and maintainability. Worth seeing if it's
workable.
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Mar 20, 2023 at 9:34 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Mon, Mar 20, 2023 at 12:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Mar 17, 2023 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Mar 17, 2023 at 4:03 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Mar 14, 2023 at 8:27 PM John Naylor
<john.naylor@enterprisedb.com> wrote:I wrote:
Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if it's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future cases at the moment.
IIUC you suggested measuring memory usage by tracking how much memory
chunks are allocated within a block. If your idea at the top of the
page follows this method, it still doesn't deal with the point Andres
mentioned.Right, but that idea was orthogonal to how we measure memory use, and in fact mentions blocks specifically. The re-ordering was just to make sure that progress reporting didn't show current-use > max-use.
Right. I still like your re-ordering idea. It's true that the most
area of the last allocated block before heap scanning stops is not
actually used yet. I'm guessing we can just check if the context
memory has gone over the limit. But I'm concerned it might not work
well in systems where overcommit memory is disabled.However, the big question remains DSA, since a new segment can be as large as the entire previous set of allocations. It seems it just wasn't designed for things where memory growth is unpredictable.
aset.c also has a similar characteristic; allocates an 8K block upon
the first allocation in a context, and doubles that size for each
successive block request. But we can specify the initial block size
and max blocksize. This made me think of another idea to specify both
to DSA and both values are calculated based on m_w_m. For example, weThat's an interesting idea, and the analogous behavior to aset could be a good thing for readability and maintainability. Worth seeing if it's workable.
I've attached a quick hack patch. It can be applied on top of v32
patches. The changes to dsa.c are straightforward since it makes the
initial and max block sizes configurable. The patch includes a test
function, test_memory_usage() to simulate how DSA segments grow behind
the shared radix tree. If we set the first argument to true, it
calculates both initial and maximum block size based on work_mem (I
used work_mem here just because its value range is larger than m_w_m):
postgres(1:833654)=# select test_memory_usage(true);
NOTICE: memory limit 134217728
NOTICE: init 1048576 max 16777216
NOTICE: initial: 1048576
NOTICE: rt_create: 1048576
NOTICE: allocate new DSM [1] 1048576
NOTICE: allocate new DSM [2] 2097152
NOTICE: allocate new DSM [3] 2097152
NOTICE: allocate new DSM [4] 4194304
NOTICE: allocate new DSM [5] 4194304
NOTICE: allocate new DSM [6] 8388608
NOTICE: allocate new DSM [7] 8388608
NOTICE: allocate new DSM [8] 16777216
NOTICE: allocate new DSM [9] 16777216
NOTICE: allocate new DSM [10] 16777216
NOTICE: allocate new DSM [11] 16777216
NOTICE: allocate new DSM [12] 16777216
NOTICE: allocate new DSM [13] 16777216
NOTICE: allocate new DSM [14] 16777216
NOTICE: reached: 148897792 (+14680064)
NOTICE: 12718205 keys inserted: 148897792
test_memory_usage
-------------------
(1 row)
Time: 7195.664 ms (00:07.196)
Setting the first argument to false, we can specify both manually in
second and third arguments:
postgres(1:833654)=# select test_memory_usage(false, 1024 * 1024, 1024
* 1024 * 1024 * 10::bigint);
NOTICE: memory limit 134217728
NOTICE: init 1048576 max 10737418240
NOTICE: initial: 1048576
NOTICE: rt_create: 1048576
NOTICE: allocate new DSM [1] 1048576
NOTICE: allocate new DSM [2] 2097152
NOTICE: allocate new DSM [3] 2097152
NOTICE: allocate new DSM [4] 4194304
NOTICE: allocate new DSM [5] 4194304
NOTICE: allocate new DSM [6] 8388608
NOTICE: allocate new DSM [7] 8388608
NOTICE: allocate new DSM [8] 16777216
NOTICE: allocate new DSM [9] 16777216
NOTICE: allocate new DSM [10] 33554432
NOTICE: allocate new DSM [11] 33554432
NOTICE: allocate new DSM [12] 67108864
NOTICE: reached: 199229440 (+65011712)
NOTICE: 12718205 keys inserted: 199229440
test_memory_usage
-------------------
(1 row)
Time: 7187.571 ms (00:07.188)
It seems to work fine. The differences between the above two cases is
the maximum block size (16MB .vs 10GB). We allocated two more DSA
segments in the first segments but there was no big difference in the
performance in my test environment.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
dsa_init_max_block_size.patch.txttext/plain; charset=US-ASCII; name=dsa_init_max_block_size.patch.txtDownload
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index ad66265e23..12121dd1d4 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -86,3 +86,12 @@ OUT iter_ms int8
returns record
as 'MODULE_PATHNAME'
LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function test_memory_usage(
+use_m_w_m bool,
+init_blksize int8 default (1024 * 1024),
+max_blksize int8 default (1024 * 1024 * 1024 * 10::bigint)
+)
+returns void
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 41d83aee11..0580faed6c 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -40,6 +40,18 @@ PG_MODULE_MAGIC;
// #define RT_SHMEM
#include "lib/radixtree.h"
+//#define RT_DEBUG
+#define RT_PREFIX shared_rt
+#define RT_SCOPE
+#define RT_DECLARE
+#define RT_DEFINE
+//#define RT_USE_DELETE
+//#define RT_MEASURE_MEMORY_USAGE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+#define RT_SHMEM
+#include "lib/radixtree.h"
+
/*
* Return the number of keys in the radix tree.
*/
@@ -57,6 +69,7 @@ PG_FUNCTION_INFO_V1(bench_fixed_height_search);
PG_FUNCTION_INFO_V1(bench_search_random_nodes);
PG_FUNCTION_INFO_V1(bench_node128_load);
PG_FUNCTION_INFO_V1(bench_tidstore_load);
+PG_FUNCTION_INFO_V1(test_memory_usage);
static uint64
tid_to_key_off(ItemPointer tid, uint32 *off)
@@ -745,4 +758,56 @@ stub_iter()
iter = rt_begin_iterate(rt);
rt_iterate_next(iter, &key, &value);
rt_end_iterate(iter);
-}
\ No newline at end of file
+}
+
+Datum
+test_memory_usage(PG_FUNCTION_ARGS)
+{
+ bool use_work_mem = PG_GETARG_BOOL(0);
+ int64 init = PG_GETARG_INT64(1);
+ int64 max = PG_GETARG_INT64(2);
+ int tranche_id = LWLockNewTrancheId();
+ const int limit = work_mem * 1024;
+ dsa_area *dsa;
+ shared_rt_radix_tree *rt;
+ uint64 i;
+
+ LWLockRegisterTranche(tranche_id, "test");
+
+ if (use_work_mem)
+ {
+ init = Min(((int64)work_mem * 1024) / 4, 1024 * 1024);
+ max = Max(((int64)work_mem * 1024) / 8, (int64) 8 * 1024 * 1024);
+ }
+
+ elog(NOTICE, "memory limit %ld", (int64) work_mem * 1024);
+ elog(NOTICE, "init %ld max %ld", init, max);
+ dsa = dsa_create_ext(tranche_id, init, max);
+
+ elog(NOTICE, "initial: %zu", dsa_get_total_segment_size(dsa));
+
+ rt = shared_rt_create(CurrentMemoryContext, dsa, tranche_id);
+ elog(NOTICE, "rt_create: %zu", dsa_get_total_segment_size(dsa));
+
+ for (i = 0; i < (1000 * 1000 * 1000); i++)
+ {
+ volatile bool ret;
+ size_t size;
+
+ ret = shared_rt_set(rt, i, &i);
+
+ size = dsa_get_total_segment_size(dsa);
+
+ if (limit < size)
+ {
+ elog(NOTICE, "reached: %zu (+%zu)", size, size - limit);
+ break;
+ }
+ }
+
+ elog(NOTICE, "%ld keys inserted: %zu", i, dsa_get_total_segment_size(dsa));
+
+ shared_rt_free(rt);
+ dsa_detach(dsa);
+ PG_RETURN_VOID();
+}
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..a81008d84e 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -60,14 +60,6 @@
#include "utils/freepage.h"
#include "utils/memutils.h"
-/*
- * The size of the initial DSM segment that backs a dsa_area created by
- * dsa_create. After creating some number of segments of this size we'll
- * double this size, and so on. Larger segments may be created if necessary
- * to satisfy large requests.
- */
-#define DSA_INITIAL_SEGMENT_SIZE ((size_t) (1 * 1024 * 1024))
-
/*
* How many segments to create before we double the segment size. If this is
* low, then there is likely to be a lot of wasted space in the largest
@@ -77,17 +69,6 @@
*/
#define DSA_NUM_SEGMENTS_AT_EACH_SIZE 2
-/*
- * The number of bits used to represent the offset part of a dsa_pointer.
- * This controls the maximum size of a segment, the maximum possible
- * allocation size and also the maximum number of segments per area.
- */
-#if SIZEOF_DSA_POINTER == 4
-#define DSA_OFFSET_WIDTH 27 /* 32 segments of size up to 128MB */
-#else
-#define DSA_OFFSET_WIDTH 40 /* 1024 segments of size up to 1TB */
-#endif
-
/*
* The maximum number of DSM segments that an area can own, determined by
* the number of bits remaining (but capped at 1024).
@@ -98,9 +79,6 @@
/* The bitmask for extracting the offset from a dsa_pointer. */
#define DSA_OFFSET_BITMASK (((dsa_pointer) 1 << DSA_OFFSET_WIDTH) - 1)
-/* The maximum size of a DSM segment. */
-#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
-
/* Number of pages (see FPM_PAGE_SIZE) per regular superblock. */
#define DSA_PAGES_PER_SUPERBLOCK 16
@@ -319,6 +297,10 @@ typedef struct
dsa_segment_index segment_bins[DSA_NUM_SEGMENT_BINS];
/* The object pools for each size class. */
dsa_area_pool pools[DSA_NUM_SIZE_CLASSES];
+ /* initial allocation segment size */
+ size_t init_segment_size;
+ /* maximum allocation segment size */
+ size_t max_segment_size;
/* The total size of all active segments. */
size_t total_segment_size;
/* The maximum total size of backing storage we are allowed. */
@@ -413,7 +395,9 @@ static dsa_segment_map *make_new_segment(dsa_area *area, size_t requested_pages)
static dsa_area *create_internal(void *place, size_t size,
int tranche_id,
dsm_handle control_handle,
- dsm_segment *control_segment);
+ dsm_segment *control_segment,
+ size_t init_segment_size,
+ size_t max_segment_size);
static dsa_area *attach_internal(void *place, dsm_segment *segment,
dsa_handle handle);
static void check_for_freed_segments(dsa_area *area);
@@ -429,7 +413,7 @@ static void check_for_freed_segments_locked(dsa_area *area);
* we require the caller to provide one.
*/
dsa_area *
-dsa_create(int tranche_id)
+dsa_create_ext(int tranche_id, size_t init_segment_size, size_t max_segment_size)
{
dsm_segment *segment;
dsa_area *area;
@@ -438,7 +422,7 @@ dsa_create(int tranche_id)
* Create the DSM segment that will hold the shared control object and the
* first segment of usable space.
*/
- segment = dsm_create(DSA_INITIAL_SEGMENT_SIZE, 0);
+ segment = dsm_create(init_segment_size, 0);
/*
* All segments backing this area are pinned, so that DSA can explicitly
@@ -450,9 +434,10 @@ dsa_create(int tranche_id)
/* Create a new DSA area with the control object in this segment. */
area = create_internal(dsm_segment_address(segment),
- DSA_INITIAL_SEGMENT_SIZE,
+ init_segment_size,
tranche_id,
- dsm_segment_handle(segment), segment);
+ dsm_segment_handle(segment), segment,
+ init_segment_size, max_segment_size);
/* Clean up when the control segment detaches. */
on_dsm_detach(segment, &dsa_on_dsm_detach_release_in_place,
@@ -478,13 +463,15 @@ dsa_create(int tranche_id)
* See dsa_create() for a note about the tranche arguments.
*/
dsa_area *
-dsa_create_in_place(void *place, size_t size,
- int tranche_id, dsm_segment *segment)
+dsa_create_in_place_ext(void *place, size_t size,
+ int tranche_id, dsm_segment *segment,
+ size_t init_segment_size, size_t max_segment_size)
{
dsa_area *area;
area = create_internal(place, size, tranche_id,
- DSM_HANDLE_INVALID, NULL);
+ DSM_HANDLE_INVALID, NULL,
+ init_segment_size, max_segment_size);
/*
* Clean up when the control segment detaches, if a containing DSM segment
@@ -1024,6 +1011,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_segment_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
@@ -1203,7 +1202,8 @@ static dsa_area *
create_internal(void *place, size_t size,
int tranche_id,
dsm_handle control_handle,
- dsm_segment *control_segment)
+ dsm_segment *control_segment,
+ size_t init_segment_size, size_t max_segment_size)
{
dsa_area_control *control;
dsa_area *area;
@@ -1213,6 +1213,9 @@ create_internal(void *place, size_t size,
size_t metadata_bytes;
int i;
+ Assert(max_segment_size >= init_segment_size);
+ Assert(max_segment_size <= DSA_MAX_SEGMENT_SIZE);
+
/* Sanity check on the space we have to work in. */
if (size < dsa_minimum_size())
elog(ERROR, "dsa_area space must be at least %zu, but %zu provided",
@@ -1242,8 +1245,10 @@ create_internal(void *place, size_t size,
control->segment_header.prev = DSA_SEGMENT_INDEX_NONE;
control->segment_header.usable_pages = usable_pages;
control->segment_header.freed = false;
- control->segment_header.size = DSA_INITIAL_SEGMENT_SIZE;
+ control->segment_header.size = size;
control->handle = control_handle;
+ control->init_segment_size = init_segment_size;
+ control->max_segment_size = max_segment_size;
control->max_total_segment_size = (size_t) -1;
control->total_segment_size = size;
control->segment_handles[0] = control_handle;
@@ -2112,12 +2117,13 @@ make_new_segment(dsa_area *area, size_t requested_pages)
* move to huge pages in the future. Then we work back to the number of
* pages we can fit.
*/
- total_size = DSA_INITIAL_SEGMENT_SIZE *
+ total_size = area->control->init_segment_size *
((size_t) 1 << (new_index / DSA_NUM_SEGMENTS_AT_EACH_SIZE));
- total_size = Min(total_size, DSA_MAX_SEGMENT_SIZE);
+ total_size = Min(total_size, area->control->max_segment_size);
total_size = Min(total_size,
area->control->max_total_segment_size -
area->control->total_segment_size);
+ elog(NOTICE, "allocate new DSM [%zu] %zu", new_index, total_size);
total_pages = total_size / FPM_PAGE_SIZE;
metadata_bytes =
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..0baa32b9de 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -77,6 +77,28 @@ typedef pg_atomic_uint64 dsa_pointer_atomic;
/* A sentinel value for dsa_pointer used to indicate failure to allocate. */
#define InvalidDsaPointer ((dsa_pointer) 0)
+/*
+ * The size of the initial DSM segment that backs a dsa_area created by
+ * dsa_create. After creating some number of segments of this size we'll
+ * double this size, and so on. Larger segments may be created if necessary
+ * to satisfy large requests.
+ */
+#define DSA_INITIAL_SEGMENT_SIZE ((size_t) (1 * 1024 * 1024))
+
+/*
+ * The number of bits used to represent the offset part of a dsa_pointer.
+ * This controls the maximum size of a segment, the maximum possible
+ * allocation size and also the maximum number of segments per area.
+ */
+#if SIZEOF_DSA_POINTER == 4
+#define DSA_OFFSET_WIDTH 27 /* 32 segments of size up to 128MB */
+#else
+#define DSA_OFFSET_WIDTH 40 /* 1024 segments of size up to 1TB */
+#endif
+
+/* The maximum size of a DSM segment. */
+#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
+
/* Check if a dsa_pointer value is valid. */
#define DsaPointerIsValid(x) ((x) != InvalidDsaPointer)
@@ -88,6 +110,14 @@ typedef pg_atomic_uint64 dsa_pointer_atomic;
#define dsa_allocate0(area, size) \
dsa_allocate_extended(area, size, DSA_ALLOC_ZERO)
+/* Create dsa_area with default segment sizes */
+#define dsa_create(tranch_id) \
+ dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)
+
+/* Create dsa_area with default segment sizes in an existing share memory space */
+#define dsa_create_in_place(place, size, tranch_id, segment) \
+ dsa_create_in_place_ext(place, size, tranch_id, segment, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)
+
/*
* The type used for dsa_area handles. dsa_handle values can be shared with
* other processes, so that they can attach to them. This provides a way to
@@ -102,10 +132,12 @@ typedef dsm_handle dsa_handle;
/* Sentinel value to use for invalid dsa_handles. */
#define DSA_HANDLE_INVALID ((dsa_handle) DSM_HANDLE_INVALID)
-
-extern dsa_area *dsa_create(int tranche_id);
-extern dsa_area *dsa_create_in_place(void *place, size_t size,
- int tranche_id, dsm_segment *segment);
+extern dsa_area *dsa_create_ext(int tranche_id, size_t init_segment_size,
+ size_t max_segment_size);
+extern dsa_area *dsa_create_in_place_ext(void *place, size_t size,
+ int tranche_id, dsm_segment *segment,
+ size_t init_segment_size,
+ size_t max_segment_size);
extern dsa_area *dsa_attach(dsa_handle handle);
extern dsa_area *dsa_attach_in_place(void *place, dsm_segment *segment);
extern void dsa_release_in_place(void *place);
@@ -117,6 +149,7 @@ extern void dsa_pin(dsa_area *area);
extern void dsa_unpin(dsa_area *area);
extern void dsa_set_size_limit(dsa_area *area, size_t limit);
extern size_t dsa_minimum_size(void);
+extern size_t dsa_get_total_segment_size(dsa_area *area);
extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
On Mon, Mar 20, 2023 at 9:34 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Mon, Mar 20, 2023 at 9:34 PM John Naylor
<john.naylor@enterprisedb.com> wrote:That's an interesting idea, and the analogous behavior to aset could be
a good thing for readability and maintainability. Worth seeing if it's
workable.
I've attached a quick hack patch. It can be applied on top of v32
patches. The changes to dsa.c are straightforward since it makes the
initial and max block sizes configurable.
Good to hear -- this should probably be proposed in a separate thread for
wider visibility.
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Mar 21, 2023 at 2:41 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Mon, Mar 20, 2023 at 9:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Mar 20, 2023 at 9:34 PM John Naylor
<john.naylor@enterprisedb.com> wrote:That's an interesting idea, and the analogous behavior to aset could be a good thing for readability and maintainability. Worth seeing if it's workable.
I've attached a quick hack patch. It can be applied on top of v32
patches. The changes to dsa.c are straightforward since it makes the
initial and max block sizes configurable.Good to hear -- this should probably be proposed in a separate thread for wider visibility.
Agreed. I'll start a new thread for that.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Thu, Feb 16, 2023 at 11:44 PM Andres Freund <andres@anarazel.de> wrote:
We really ought to replace the tid bitmap used for bitmap heap scans. The
hashtable we use is a pretty awful data structure for it. And that's not
filled in-order, for example.
I spent some time studying tidbitmap.c, and not only does it make sense to
use a radix tree there, but since it has more complex behavior and stricter
runtime requirements, it should really be the thing driving the design and
tradeoffs, not vacuum:
- With lazy expansion and single-value leaves, the root of a radix tree can
point to a single leaf. That might get rid of the need to track TBMStatus,
since setting a single-leaf tree should be cheap.
- Fixed-size PagetableEntry's are pretty large, but the tid compression
scheme used in this thread (in addition to being complex) is not a great
fit for tidbitmap because it makes it more difficult to track per-block
metadata (see also next point). With the "combined pointer-value slots"
technique, if a page's max tid offset is 63 or less, the offsets can be
stored directly in the pointer for the exact case. The lowest bit can tag
to indicate a pointer to a single-value leaf. That would complicate
operations like union/intersection and tracking "needs recheck", but it
would reduce memory use and node-traversal in common cases.
- Managing lossy storage. With pure blocknumber keys, replacing exact
storage for a range of 256 pages amounts to replacing a last-level node
with a single leaf containing one lossy PagetableEntry. The leader could
iterate over the nodes, and rank the last-level nodes by how much storage
they (possibly with leaf children) are using, and come up with an optimal
lossy-conversion plan.
The above would address the points (not including better iteration and
parallel bitmap index scans) raised in
/messages/by-id/CAPsAnrn5yWsoWs8GhqwbwAJx1SeLxLntV54Biq0Z-J_E86Fnng@mail.gmail.com
Ironically, by targeting a more difficult use case, it's easier since there
is less freedom. There are many ways to beat a binary search, but fewer
good ways to improve bitmap heap scan. I'd like to put aside vacuum for
some time and try killing two birds with one stone, building upon our work
thus far.
Note: I've moved the CF entry to the next CF, and set to waiting on
author for now. Since no action is currently required from Masahiko, I've
added myself as author as well. If tackling bitmap heap scan shows promise,
we could RWF and resurrect at a later time.
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Apr 7, 2023 at 6:55 PM John Naylor <john.naylor@enterprisedb.com> wrote:
On Thu, Feb 16, 2023 at 11:44 PM Andres Freund <andres@anarazel.de> wrote:
We really ought to replace the tid bitmap used for bitmap heap scans. The
hashtable we use is a pretty awful data structure for it. And that's not
filled in-order, for example.I spent some time studying tidbitmap.c, and not only does it make sense to use a radix tree there, but since it has more complex behavior and stricter runtime requirements, it should really be the thing driving the design and tradeoffs, not vacuum:
- With lazy expansion and single-value leaves, the root of a radix tree can point to a single leaf. That might get rid of the need to track TBMStatus, since setting a single-leaf tree should be cheap.
Instead of introducing single-value leaves to the radix tree as
another structure, can we store pointers to PagetableEntry as values?
- Fixed-size PagetableEntry's are pretty large, but the tid compression scheme used in this thread (in addition to being complex) is not a great fit for tidbitmap because it makes it more difficult to track per-block metadata (see also next point). With the "combined pointer-value slots" technique, if a page's max tid offset is 63 or less, the offsets can be stored directly in the pointer for the exact case. The lowest bit can tag to indicate a pointer to a single-value leaf. That would complicate operations like union/intersection and tracking "needs recheck", but it would reduce memory use and node-traversal in common cases.
- Managing lossy storage. With pure blocknumber keys, replacing exact storage for a range of 256 pages amounts to replacing a last-level node with a single leaf containing one lossy PagetableEntry. The leader could iterate over the nodes, and rank the last-level nodes by how much storage they (possibly with leaf children) are using, and come up with an optimal lossy-conversion plan.
The above would address the points (not including better iteration and parallel bitmap index scans) raised in
/messages/by-id/CAPsAnrn5yWsoWs8GhqwbwAJx1SeLxLntV54Biq0Z-J_E86Fnng@mail.gmail.com
Ironically, by targeting a more difficult use case, it's easier since there is less freedom. There are many ways to beat a binary search, but fewer good ways to improve bitmap heap scan. I'd like to put aside vacuum for some time and try killing two birds with one stone, building upon our work thus far.
Note: I've moved the CF entry to the next CF, and set to waiting on author for now. Since no action is currently required from Masahiko, I've added myself as author as well. If tackling bitmap heap scan shows promise, we could RWF and resurrect at a later time.
Thanks. I'm going to continue researching the memory limitation and
try lazy path expansion until PG17 development begins.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Sat, Mar 11, 2023 at 12:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Mar 10, 2023 at 11:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Mar 10, 2023 at 3:42 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Thu, Mar 9, 2023 at 1:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've attached the new version patches. I merged improvements and fixes
I did in the v29 patch.I haven't yet had a chance to look at those closely, since I've had to devote time to other commitments. I remember I wasn't particularly impressed that v29-0008 mixed my requested name-casing changes with a bunch of other random things. Separating those out would be an obvious way to make it easier for me to look at, whenever I can get back to this. I need to look at the iteration changes as well, in addition to testing memory measurement (thanks for the new results, they look encouraging).
Okay, I'll separate them again.
Attached new patch series. In addition to separate them again, I've
fixed a conflict with HEAD.
I've attached updated version patches to make cfbot happy. Also, I've
splitted fixup patches further(from 0007 except for 0016 and 0018) to
make reviews easy. These patches have the prefix radix tree, tidstore,
and vacuum, indicating the part it changes. 0016 patch is to change
DSA so that we can specify both the initial and max segment size and
0017 makes use of it in vacuumparallel.c I'm still researching a
better solution for memory limitation but it's the best solution for
me for now.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v32-0015-vacuum-Miscellaneous-updates.patchapplication/octet-stream; name=v32-0015-vacuum-Miscellaneous-updates.patchDownload
From 16e55ffde1cb152dc94cf38a9f6c8442b78be284 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 18:07:04 +0900
Subject: [PATCH v32 15/18] vacuum: Miscellaneous updates
fix typos, comment updates, etc.
---
doc/src/sgml/monitoring.sgml | 2 +-
src/backend/access/heap/vacuumlazy.c | 17 ++++++++---------
src/backend/commands/vacuumparallel.c | 13 +++++++------
3 files changed, 16 insertions(+), 16 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9b64614beb..67ab9fa2bc 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7331,7 +7331,7 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
+ <structfield>dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
Amount of dead tuple data collected since the last index vacuum cycle.
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index be487aced6..228daad750 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -10,11 +10,10 @@
* of dead TIDs at once.
*
* We are willing to use at most maintenance_work_mem (or perhaps
- * autovacuum_work_mem) memory space to keep track of dead TIDs. We initially
- * create a TidStore with the maximum bytes that can be used by the TidStore.
- * If the TidStore is full, we must call lazy_vacuum to vacuum indexes (and to
- * vacuum the pages that we've pruned). This frees up the memory space dedicated
- * to storing dead TIDs.
+ * autovacuum_work_mem) memory space to keep track of dead TIDs. If the
+ * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
+ * the pages that we've pruned). This frees up the memory space dedicated to
+ * to store dead TIDs.
*
* In practice VACUUM will often complete its initial pass over the target
* heap relation without ever running out of space to store TIDs. This means
@@ -2392,7 +2391,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
TidStoreIter *iter;
- TidStoreIterResult *result;
+ TidStoreIterResult *iter_result;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2417,7 +2416,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- blkno = result->blkno;
+ blkno = iter_result->blkno;
vacrel->blkno = blkno;
/*
@@ -2431,8 +2430,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
- buf, vmbuffer);
+ lazy_vacuum_heap_page(vacrel, blkno, iter_result->offsets,
+ iter_result->num_offsets, buf, vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index be83ceb871..8385d375db 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -9,11 +9,12 @@
* In a parallel vacuum, we perform both index bulk deletion and index cleanup
* with parallel worker processes. Individual indexes are processed by one
* vacuum process. ParalleVacuumState contains shared information as well as
- * the shared TidStore. We launch parallel worker processes at the start of
- * parallel index bulk-deletion and index cleanup and once all indexes are
- * processed, the parallel worker processes exit. Each time we process indexes
- * in parallel, the parallel context is re-initialized so that the same DSM can
- * be used for multiple passes of index bulk-deletion and index cleanup.
+ * the memory space for storing dead items allocated in the DSA area. We
+ * launch parallel worker processes at the start of parallel index
+ * bulk-deletion and index cleanup and once all indexes are processed, the
+ * parallel worker processes exit. Each time we process indexes in parallel,
+ * the parallel context is re-initialized so that the same DSM can be used for
+ * multiple passes of index bulk-deletion and index cleanup.
*
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
@@ -299,7 +300,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ /* Initial size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
--
2.31.1
v32-0016-Make-initial-and-maximum-DSA-segment-size-config.patchapplication/octet-stream; name=v32-0016-Make-initial-and-maximum-DSA-segment-size-config.patchDownload
From bc7b41a404cc1c8050c400919836951f78456aef Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 21:59:12 +0900
Subject: [PATCH v32 16/18] Make initial and maximum DSA segment size
configurable
---
src/backend/utils/mmgr/dsa.c | 64 +++++++++++++++++-------------------
src/include/utils/dsa.h | 45 ++++++++++++++++++++++---
2 files changed, 71 insertions(+), 38 deletions(-)
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 80555aefff..b6238bf4a3 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -60,14 +60,6 @@
#include "utils/freepage.h"
#include "utils/memutils.h"
-/*
- * The size of the initial DSM segment that backs a dsa_area created by
- * dsa_create. After creating some number of segments of this size we'll
- * double this size, and so on. Larger segments may be created if necessary
- * to satisfy large requests.
- */
-#define DSA_INITIAL_SEGMENT_SIZE ((size_t) (1 * 1024 * 1024))
-
/*
* How many segments to create before we double the segment size. If this is
* low, then there is likely to be a lot of wasted space in the largest
@@ -77,17 +69,6 @@
*/
#define DSA_NUM_SEGMENTS_AT_EACH_SIZE 2
-/*
- * The number of bits used to represent the offset part of a dsa_pointer.
- * This controls the maximum size of a segment, the maximum possible
- * allocation size and also the maximum number of segments per area.
- */
-#if SIZEOF_DSA_POINTER == 4
-#define DSA_OFFSET_WIDTH 27 /* 32 segments of size up to 128MB */
-#else
-#define DSA_OFFSET_WIDTH 40 /* 1024 segments of size up to 1TB */
-#endif
-
/*
* The maximum number of DSM segments that an area can own, determined by
* the number of bits remaining (but capped at 1024).
@@ -98,9 +79,6 @@
/* The bitmask for extracting the offset from a dsa_pointer. */
#define DSA_OFFSET_BITMASK (((dsa_pointer) 1 << DSA_OFFSET_WIDTH) - 1)
-/* The maximum size of a DSM segment. */
-#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
-
/* Number of pages (see FPM_PAGE_SIZE) per regular superblock. */
#define DSA_PAGES_PER_SUPERBLOCK 16
@@ -319,6 +297,10 @@ typedef struct
dsa_segment_index segment_bins[DSA_NUM_SEGMENT_BINS];
/* The object pools for each size class. */
dsa_area_pool pools[DSA_NUM_SIZE_CLASSES];
+ /* initial allocation segment size */
+ size_t init_segment_size;
+ /* maximum allocation segment size */
+ size_t max_segment_size;
/* The total size of all active segments. */
size_t total_segment_size;
/* The maximum total size of backing storage we are allowed. */
@@ -413,7 +395,9 @@ static dsa_segment_map *make_new_segment(dsa_area *area, size_t requested_pages)
static dsa_area *create_internal(void *place, size_t size,
int tranche_id,
dsm_handle control_handle,
- dsm_segment *control_segment);
+ dsm_segment *control_segment,
+ size_t init_segment_size,
+ size_t max_segment_size);
static dsa_area *attach_internal(void *place, dsm_segment *segment,
dsa_handle handle);
static void check_for_freed_segments(dsa_area *area);
@@ -429,7 +413,8 @@ static void check_for_freed_segments_locked(dsa_area *area);
* we require the caller to provide one.
*/
dsa_area *
-dsa_create(int tranche_id)
+dsa_create_extended(int tranche_id, size_t init_segment_size,
+ size_t max_segment_size)
{
dsm_segment *segment;
dsa_area *area;
@@ -438,7 +423,7 @@ dsa_create(int tranche_id)
* Create the DSM segment that will hold the shared control object and the
* first segment of usable space.
*/
- segment = dsm_create(DSA_INITIAL_SEGMENT_SIZE, 0);
+ segment = dsm_create(init_segment_size, 0);
/*
* All segments backing this area are pinned, so that DSA can explicitly
@@ -450,9 +435,10 @@ dsa_create(int tranche_id)
/* Create a new DSA area with the control object in this segment. */
area = create_internal(dsm_segment_address(segment),
- DSA_INITIAL_SEGMENT_SIZE,
+ init_segment_size,
tranche_id,
- dsm_segment_handle(segment), segment);
+ dsm_segment_handle(segment), segment,
+ init_segment_size, max_segment_size);
/* Clean up when the control segment detaches. */
on_dsm_detach(segment, &dsa_on_dsm_detach_release_in_place,
@@ -478,13 +464,15 @@ dsa_create(int tranche_id)
* See dsa_create() for a note about the tranche arguments.
*/
dsa_area *
-dsa_create_in_place(void *place, size_t size,
- int tranche_id, dsm_segment *segment)
+dsa_create_in_place_extended(void *place, size_t size,
+ int tranche_id, dsm_segment *segment,
+ size_t init_segment_size, size_t max_segment_size)
{
dsa_area *area;
area = create_internal(place, size, tranche_id,
- DSM_HANDLE_INVALID, NULL);
+ DSM_HANDLE_INVALID, NULL,
+ init_segment_size, max_segment_size);
/*
* Clean up when the control segment detaches, if a containing DSM segment
@@ -1215,7 +1203,8 @@ static dsa_area *
create_internal(void *place, size_t size,
int tranche_id,
dsm_handle control_handle,
- dsm_segment *control_segment)
+ dsm_segment *control_segment,
+ size_t init_segment_size, size_t max_segment_size)
{
dsa_area_control *control;
dsa_area *area;
@@ -1225,6 +1214,11 @@ create_internal(void *place, size_t size,
size_t metadata_bytes;
int i;
+ /* Validate the initial and maximum block sizes */
+ Assert(init_segment_size >= 1024);
+ Assert(max_segment_size >= init_segment_size);
+ Assert(max_segment_size <= DSA_MAX_SEGMENT_SIZE);
+
/* Sanity check on the space we have to work in. */
if (size < dsa_minimum_size())
elog(ERROR, "dsa_area space must be at least %zu, but %zu provided",
@@ -1254,8 +1248,10 @@ create_internal(void *place, size_t size,
control->segment_header.prev = DSA_SEGMENT_INDEX_NONE;
control->segment_header.usable_pages = usable_pages;
control->segment_header.freed = false;
- control->segment_header.size = DSA_INITIAL_SEGMENT_SIZE;
+ control->segment_header.size = size;
control->handle = control_handle;
+ control->init_segment_size = init_segment_size;
+ control->max_segment_size = max_segment_size;
control->max_total_segment_size = (size_t) -1;
control->total_segment_size = size;
control->segment_handles[0] = control_handle;
@@ -2124,9 +2120,9 @@ make_new_segment(dsa_area *area, size_t requested_pages)
* move to huge pages in the future. Then we work back to the number of
* pages we can fit.
*/
- total_size = DSA_INITIAL_SEGMENT_SIZE *
+ total_size = area->control->init_segment_size *
((size_t) 1 << (new_index / DSA_NUM_SEGMENTS_AT_EACH_SIZE));
- total_size = Min(total_size, DSA_MAX_SEGMENT_SIZE);
+ total_size = Min(total_size, area->control->max_segment_size);
total_size = Min(total_size,
area->control->max_total_segment_size -
area->control->total_segment_size);
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 2af215484f..90b7b0d93f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -77,6 +77,28 @@ typedef pg_atomic_uint64 dsa_pointer_atomic;
/* A sentinel value for dsa_pointer used to indicate failure to allocate. */
#define InvalidDsaPointer ((dsa_pointer) 0)
+/*
+ * The default size of the initial DSM segment that backs a dsa_area created
+ * by dsa_create. After creating some number of segments of this size we'll
+ * double this size, and so on. Larger segments may be created if necessary
+ * to satisfy large requests.
+ */
+#define DSA_INITIAL_SEGMENT_SIZE ((size_t) (1 * 1024 * 1024))
+
+/*
+ * The number of bits used to represent the offset part of a dsa_pointer.
+ * This controls the maximum size of a segment, the maximum possible
+ * allocation size and also the maximum number of segments per area.
+ */
+#if SIZEOF_DSA_POINTER == 4
+#define DSA_OFFSET_WIDTH 27 /* 32 segments of size up to 128MB */
+#else
+#define DSA_OFFSET_WIDTH 40 /* 1024 segments of size up to 1TB */
+#endif
+
+/* The maximum size of a DSM segment. */
+#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
+
/* Check if a dsa_pointer value is valid. */
#define DsaPointerIsValid(x) ((x) != InvalidDsaPointer)
@@ -88,6 +110,19 @@ typedef pg_atomic_uint64 dsa_pointer_atomic;
#define dsa_allocate0(area, size) \
dsa_allocate_extended(area, size, DSA_ALLOC_ZERO)
+/* Create dsa_area with default segment sizes */
+#define dsa_create(tranch_id) \
+ dsa_create_extended(tranch_id, DSA_INITIAL_SEGMENT_SIZE, \
+ DSA_MAX_SEGMENT_SIZE)
+
+/*
+ * Create dsa_area with default segment sizes in an existing share memory
+ * space.
+ */
+#define dsa_create_in_place(place, size, tranch_id, segment) \
+ dsa_create_in_place_extended(place, size, tranch_id, segment, \
+ DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)
+
/*
* The type used for dsa_area handles. dsa_handle values can be shared with
* other processes, so that they can attach to them. This provides a way to
@@ -102,10 +137,12 @@ typedef dsm_handle dsa_handle;
/* Sentinel value to use for invalid dsa_handles. */
#define DSA_HANDLE_INVALID ((dsa_handle) DSM_HANDLE_INVALID)
-
-extern dsa_area *dsa_create(int tranche_id);
-extern dsa_area *dsa_create_in_place(void *place, size_t size,
- int tranche_id, dsm_segment *segment);
+extern dsa_area *dsa_create_extended(int tranche_id, size_t init_segment_size,
+ size_t max_segment_size);
+extern dsa_area *dsa_create_in_place_extended(void *place, size_t size,
+ int tranche_id, dsm_segment *segment,
+ size_t init_segment_size,
+ size_t max_segment_size);
extern dsa_area *dsa_attach(dsa_handle handle);
extern dsa_area *dsa_attach_in_place(void *place, dsm_segment *segment);
extern void dsa_release_in_place(void *place);
--
2.31.1
v32-0014-tidstore-Miscellaneous-updates.patchapplication/octet-stream; name=v32-0014-tidstore-Miscellaneous-updates.patchDownload
From f9be0044ee6e35dd44bceca59d733ba8cdf5373e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 18:01:46 +0900
Subject: [PATCH v32 14/18] tidstore: Miscellaneous updates.
comment updates, fix typos, etc.
---
src/backend/access/common/tidstore.c | 78 +++++++++++--------
.../modules/test_tidstore/test_tidstore.c | 1 +
2 files changed, 47 insertions(+), 32 deletions(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 15b77b5bcb..9360520482 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -3,18 +3,19 @@
* tidstore.c
* Tid (ItemPointerData) storage implementation.
*
- * This module provides a in-memory data structure to store Tids (ItemPointer).
- * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
- * stored in the radix tree.
+ * TidStore is a in-memory data structure to store tids (ItemPointerData).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
+ * and stored in the radix tree.
*
* TidStore can be shared among parallel worker processes by passing DSA area
* to TidStoreCreate(). Other backends can attach to the shared TidStore by
* TidStoreAttach().
*
- * Regarding the concurrency, it basically relies on the concurrency support in
- * the radix tree, but we acquires the lock on a TidStore in some cases, for
- * example, when to reset the store and when to access the number tids in the
- * store (num_tids).
+ * Regarding the concurrency support, we use a single LWLock for the TidStore.
+ * The TidStore is exclusively locked when inserting encoded tids to the
+ * radix tree or when resetting itself. When searching on the TidStore or
+ * doing the iteration, it is not locked but the underlying radix tree is
+ * locked in shared mode.
*
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
@@ -34,16 +35,18 @@
#include "utils/memutils.h"
/*
- * For encoding purposes, tids are represented as a pair of 64-bit key and
- * 64-bit value. First, we construct 64-bit unsigned integer by combining
- * the block number and the offset number. The number of bits used for the
- * offset number is specified by max_offsets in tidstore_create(). We are
- * frugal with the bits, because smaller keys could help keeping the radix
- * tree shallow.
+ * For encoding purposes, a tid is represented as a pair of 64-bit key and
+ * 64-bit value.
*
- * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
- * the offset number and uses the next 32 bits for the block number. That
- * is, only 41 bits are used:
+ * First, we construct a 64-bit unsigned integer by combining the block
+ * number and the offset number. The number of bits used for the offset number
+ * is specified by max_off in TidStoreCreate(). We are frugal with the bits,
+ * because smaller keys could help keeping the radix tree shallow.
+ *
+ * For example, a tid of heap on a 8kB block uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. 9 bits
+ * are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks. That is, only 41 bits are used:
*
* uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
*
@@ -52,25 +55,27 @@
* u = unused bit
* (high on the left, low on the right)
*
- * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
- * on 8kB blocks.
- *
- * The 64-bit value is the bitmap representation of the lowest 6 bits
- * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
- * as the key:
+ * Then, 64-bit value is the bitmap representation of the lowest 6 bits
+ * (LOWER_OFFSET_NBITS) of the integer, and 64-bit key consists of the
+ * upper 3 bits of the offset number and the block number, 35 bits in
+ * total:
*
* uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
* |----| value
- * |---------------------------------------------| key
+ * |--------------------------------------| key
*
* The maximum height of the radix tree is 5 in this case.
+ *
+ * If the number of bits required for offset numbers fits in LOWER_OFFSET_NBITS,
+ * 64-bit value is the bitmap representation of the offset number, and the
+ * 64-bit key is the block number.
*/
typedef uint64 tidkey;
typedef uint64 offsetbm;
#define LOWER_OFFSET_NBITS 6 /* log(sizeof(offsetbm), 2) */
#define LOWER_OFFSET_MASK ((1 << LOWER_OFFSET_NBITS) - 1)
-/* A magic value used to identify our TidStores. */
+/* A magic value used to identify our TidStore. */
#define TIDSTORE_MAGIC 0x826f6a10
#define RT_PREFIX local_rt
@@ -152,8 +157,10 @@ typedef struct TidStoreIter
tidkey next_tidkey;
offsetbm next_off_bitmap;
- /* output for the caller */
- TidStoreIterResult result;
+ /*
+ * output for the caller. Must be last because variable-size.
+ */
+ TidStoreIterResult output;
} TidStoreIter;
static void iter_decode_key_off(TidStoreIter *iter, tidkey key, offsetbm off_bitmap);
@@ -205,7 +212,7 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
dp = dsa_allocate0(area, sizeof(TidStoreControl));
ts->control = (TidStoreControl *) dsa_get_address(area, dp);
- ts->control->max_bytes = (uint64) (max_bytes * ratio);
+ ts->control->max_bytes = (size_t) (max_bytes * ratio);
ts->area = area;
ts->control->magic = TIDSTORE_MAGIC;
@@ -353,7 +360,11 @@ TidStoreReset(TidStore *ts)
}
}
-/* Add Tids on a block to TidStore */
+/*
+ * Set the given tids on the blkno to TidStore.
+ *
+ * NB: the offset numbers in offsets must be sorted in ascending order.
+ */
void
TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
int num_offsets)
@@ -564,7 +575,7 @@ TidStoreEndIterate(TidStoreIter *iter)
int64
TidStoreNumTids(TidStore *ts)
{
- uint64 num_tids;
+ int64 num_tids;
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
@@ -624,11 +635,14 @@ TidStoreGetHandle(TidStore *ts)
return ts->control->handle;
}
-/* Extract tids from the given key-value pair */
+/*
+ * Decode the key and offset bitmap to tids and store them to the iteration
+ * result.
+ */
static void
iter_decode_key_off(TidStoreIter *iter, tidkey key, offsetbm off_bitmap)
{
- TidStoreIterResult *result = (&iter->result);
+ TidStoreIterResult *output = (&iter->output);
while (off_bitmap)
{
@@ -661,7 +675,7 @@ key_get_blkno(TidStore *ts, tidkey key)
static inline tidkey
encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit)
{
- uint32 offset = ItemPointerGetOffsetNumber(tid);
+ OffsetNumber offset = ItemPointerGetOffsetNumber(tid);
BlockNumber block = ItemPointerGetBlockNumber(tid);
return encode_blk_off(ts, block, offset, off_bit);
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index 12d3027624..8659e6780e 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -222,6 +222,7 @@ test_tidstore(PG_FUNCTION_ARGS)
elog(NOTICE, "testing basic operations");
test_basic(MaxHeapTuplesPerPage);
test_basic(10);
+ test_basic(MaxHeapTuplesPerPage * 2);
PG_RETURN_VOID();
}
--
2.31.1
v32-0018-Revert-building-benchmark-module-for-CI.patchapplication/octet-stream; name=v32-0018-Revert-building-benchmark-module-for-CI.patchDownload
From 9e42c43a7d081c06c02f0029e610c29d911732e3 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 19:31:34 +0700
Subject: [PATCH v32 18/18] Revert building benchmark module for CI
---
contrib/meson.build | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/contrib/meson.build b/contrib/meson.build
index 421d469f8c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,7 +12,7 @@ subdir('amcheck')
subdir('auth_delay')
subdir('auto_explain')
subdir('basic_archive')
-subdir('bench_radix_tree')
+#subdir('bench_radix_tree')
subdir('bloom')
subdir('basebackup_to_shell')
subdir('bool_plperl')
--
2.31.1
v32-0017-tidstore-vacuum-Specify-the-init-and-max-DSA-seg.patchapplication/octet-stream; name=v32-0017-tidstore-vacuum-Specify-the-init-and-max-DSA-seg.patchDownload
From 11fda58f829c03d2a7c6476affc61862a078f741 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 22:27:04 +0900
Subject: [PATCH v32 17/18] tidstore, vacuum: Specify the init and max DSA
segment size based on m_w_m
---
src/backend/access/common/tidstore.c | 32 +++++----------------------
src/backend/commands/vacuumparallel.c | 21 ++++++++++++++----
2 files changed, 23 insertions(+), 30 deletions(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 9360520482..571d15c5c3 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -180,39 +180,15 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
ts = palloc0(sizeof(TidStore));
- /*
- * Create the radix tree for the main storage.
- *
- * Memory consumption depends on the number of stored tids, but also on the
- * distribution of them, how the radix tree stores, and the memory management
- * that backed the radix tree. The maximum bytes that a TidStore can
- * use is specified by the max_bytes in TidStoreCreate(). We want the total
- * amount of memory consumption by a TidStore not to exceed the max_bytes.
- *
- * In local TidStore cases, the radix tree uses slab allocators for each kind
- * of node class. The most memory consuming case while adding Tids associated
- * with one page (i.e. during TidStoreSetBlockOffsets()) is that we allocate a new
- * slab block for a new radix tree node, which is approximately 70kB. Therefore,
- * we deduct 70kB from the max_bytes.
- *
- * In shared cases, DSA allocates the memory segments big enough to follow
- * a geometric series that approximately doubles the total DSA size (see
- * make_new_segment() in dsa.c). We simulated the how DSA increases segment
- * size and the simulation revealed, the 75% threshold for the maximum bytes
- * perfectly works in case where the max_bytes is a power-of-2, and the 60%
- * threshold works for other cases.
- */
if (area != NULL)
{
dsa_pointer dp;
- float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
LWTRANCHE_SHARED_TIDSTORE);
dp = dsa_allocate0(area, sizeof(TidStoreControl));
ts->control = (TidStoreControl *) dsa_get_address(area, dp);
- ts->control->max_bytes = (size_t) (max_bytes * ratio);
ts->area = area;
ts->control->magic = TIDSTORE_MAGIC;
@@ -223,11 +199,15 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
else
{
ts->tree.local = local_rt_create(CurrentMemoryContext);
-
ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
- ts->control->max_bytes = max_bytes - (70 * 1024);
}
+ /*
+ * max_bytes is forced to be at least 64kB, the current minimum valid value
+ * for the work_mem GUC.
+ */
+ ts->control->max_bytes = Max(64 * 1024L, max_bytes);
+
ts->control->max_off = max_off;
ts->control->max_off_nbits = pg_ceil_log2_32(max_off);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 8385d375db..17699aa007 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -252,6 +252,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
Size est_indstats_len;
Size est_shared_len;
Size dsa_minsize = dsa_minimum_size();
+ Size init_segsize;
+ Size max_segsize;
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -367,12 +369,23 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
- /* Prepare DSA space for dead items */
+ /*
+ * Prepare DSA space for dead items.
+ *
+ * Since total DSA size grows while following a geometric series by default,
+ * we specify both the initial DSA segment and maximum DSA segment sizes
+ * based on the memory available for parallel vacuum. Typically, the initial
+ * segment size is 1MB and the maximum segment size is vac_work_mem / 8, and
+ * heap scan stops after allocating 1.125 times more memory than vac_work_mem.
+ */
+ init_segsize = Min(vac_work_mem / 4, (1024 * 1024));
+ max_segsize = Max(vac_work_mem / 8, (8 * 1024 * 1024));
area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
- dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
- LWTRANCHE_PARALLEL_VACUUM_DSA,
- pcxt->seg);
+ dead_items_dsa = dsa_create_in_place_extended(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg,
+ init_segsize, max_segsize);
dead_items = TidStoreCreate(vac_work_mem, max_offset, dead_items_dsa);
pvs->dead_items = dead_items;
pvs->dead_items_area = dead_items_dsa;
--
2.31.1
v32-0010-radix-tree-fix-radix-tree-test-code.patchapplication/octet-stream; name=v32-0010-radix-tree-fix-radix-tree-test-code.patchDownload
From 591bede6738ca9e5c7264db7ff1d3dd9ba29247f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 17:35:14 +0900
Subject: [PATCH v32 10/18] radix tree: fix radix tree test code
fix tests for key insertion in ascending or descending order.
Also, we missed tests for MIN and MAX size classes.
---
.../expected/test_radixtree.out | 6 +-
.../modules/test_radixtree/test_radixtree.c | 103 ++++++++++++------
2 files changed, 71 insertions(+), 38 deletions(-)
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index ce645cb8b5..7ad1ce3605 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -4,8 +4,10 @@ CREATE EXTENSION test_radixtree;
-- an error if something fails.
--
SELECT test_radixtree();
-NOTICE: testing basic operations with leaf node 4
-NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 3
+NOTICE: testing basic operations with inner node 3
+NOTICE: testing basic operations with leaf node 15
+NOTICE: testing basic operations with inner node 15
NOTICE: testing basic operations with leaf node 32
NOTICE: testing basic operations with inner node 32
NOTICE: testing basic operations with leaf node 125
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index afe53382f3..5a169854d9 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -43,12 +43,15 @@ typedef uint64 TestValueType;
*/
static const bool rt_test_stats = false;
-static int rt_node_kind_fanouts[] = {
- 0,
- 4, /* RT_NODE_KIND_4 */
- 32, /* RT_NODE_KIND_32 */
- 125, /* RT_NODE_KIND_125 */
- 256 /* RT_NODE_KIND_256 */
+/*
+ * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO?
+ */
+static int rt_node_class_fanouts[] = {
+ 3, /* RT_CLASS_3 */
+ 15, /* RT_CLASS_32_MIN */
+ 32, /* RT_CLASS_32_MAX */
+ 125, /* RT_CLASS_125 */
+ 256 /* RT_CLASS_256 */
};
/*
* A struct to define a pattern of integers, for use with the test_pattern()
@@ -260,10 +263,9 @@ test_basic(int children, bool test_inner)
* Check if keys from start to end with the shift exist in the tree.
*/
static void
-check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
- int incr)
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end)
{
- for (int i = start; i < end; i++)
+ for (int i = start; i <= end; i++)
{
uint64 key = ((uint64) i << shift);
TestValueType val;
@@ -277,22 +279,26 @@ check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
}
}
+/*
+ * Insert 256 key-value pairs, and check if keys are properly inserted on each
+ * node class.
+ */
+/* Test keys [0, 256) */
+#define NODE_TYPE_TEST_KEY_MIN 0
+#define NODE_TYPE_TEST_KEY_MAX 256
static void
-test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+test_node_types_insert_asc(rt_radix_tree *radixtree, uint8 shift)
{
- uint64 num_entries;
- int ninserted = 0;
- int start = insert_asc ? 0 : 256;
- int incr = insert_asc ? 1 : -1;
- int end = insert_asc ? 256 : 0;
- int node_kind_idx = 1;
+ uint64 num_entries;
+ int node_class_idx = 0;
+ uint64 key_checked = 0;
- for (int i = start; i != end; i += incr)
+ for (int i = NODE_TYPE_TEST_KEY_MIN; i < NODE_TYPE_TEST_KEY_MAX; i++)
{
uint64 key = ((uint64) i << shift);
bool found;
- found = rt_set(radixtree, key, (TestValueType*) &key);
+ found = rt_set(radixtree, key, (TestValueType *) &key);
if (found)
elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
@@ -300,24 +306,49 @@ test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
* After filling all slots in each node type, check if the values
* are stored properly.
*/
- if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ if ((i + 1) == rt_node_class_fanouts[node_class_idx])
{
- int check_start = insert_asc
- ? rt_node_kind_fanouts[node_kind_idx - 1]
- : rt_node_kind_fanouts[node_kind_idx];
- int check_end = insert_asc
- ? rt_node_kind_fanouts[node_kind_idx]
- : rt_node_kind_fanouts[node_kind_idx - 1];
-
- check_search_on_node(radixtree, shift, check_start, check_end, incr);
- node_kind_idx++;
+ check_search_on_node(radixtree, shift, key_checked, i);
+ key_checked = i;
+ node_class_idx++;
}
-
- ninserted++;
}
num_entries = rt_num_entries(radixtree);
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Similar to test_node_types_insert_asc(), but inserts keys in descending order.
+ */
+static void
+test_node_types_insert_desc(rt_radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+ int node_class_idx = 0;
+ uint64 key_checked = NODE_TYPE_TEST_KEY_MAX - 1;
+
+ for (int i = NODE_TYPE_TEST_KEY_MAX - 1; i >= NODE_TYPE_TEST_KEY_MIN; i--)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, (TestValueType *) &key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+ if ((i + 1) == rt_node_class_fanouts[node_class_idx])
+ {
+ check_search_on_node(radixtree, shift, i, key_checked);
+ key_checked = i;
+ node_class_idx++;
+ }
+ }
+
+ num_entries = rt_num_entries(radixtree);
if (num_entries != 256)
elog(ERROR,
"rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
@@ -329,7 +360,7 @@ test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
{
uint64 num_entries;
- for (int i = 0; i < 256; i++)
+ for (int i = NODE_TYPE_TEST_KEY_MIN; i < NODE_TYPE_TEST_KEY_MAX; i++)
{
uint64 key = ((uint64) i << shift);
bool found;
@@ -379,9 +410,9 @@ test_node_types(uint8 shift)
* then delete all entries to make it empty, and insert and search entries
* again.
*/
- test_node_types_insert(radixtree, shift, true);
+ test_node_types_insert_asc(radixtree, shift);
test_node_types_delete(radixtree, shift);
- test_node_types_insert(radixtree, shift, false);
+ test_node_types_insert_desc(radixtree, shift);
rt_free(radixtree);
#ifdef RT_SHMEM
@@ -664,10 +695,10 @@ test_radixtree(PG_FUNCTION_ARGS)
{
test_empty();
- for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ for (int i = 0; i < lengthof(rt_node_class_fanouts); i++)
{
- test_basic(rt_node_kind_fanouts[i], false);
- test_basic(rt_node_kind_fanouts[i], true);
+ test_basic(rt_node_class_fanouts[i], false);
+ test_basic(rt_node_class_fanouts[i], true);
}
for (int shift = 0; shift <= (64 - 8); shift += 8)
--
2.31.1
v32-0011-tidstore-vacuum-Use-camel-case-for-TidStore-APIs.patchapplication/octet-stream; name=v32-0011-tidstore-vacuum-Use-camel-case-for-TidStore-APIs.patchDownload
From 6f4ff3584cbbf4db3ed7268ebc360df0ad328696 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 17:47:10 +0900
Subject: [PATCH v32 11/18] tidstore, vacuum: Use camel case for TidStore APIs
---
src/backend/access/common/tidstore.c | 64 +++++++++---------
src/backend/access/heap/vacuumlazy.c | 44 ++++++------
src/backend/commands/vacuum.c | 4 +-
src/backend/commands/vacuumparallel.c | 12 ++--
src/include/access/tidstore.h | 34 +++++-----
.../modules/test_tidstore/test_tidstore.c | 67 ++++++++++---------
6 files changed, 114 insertions(+), 111 deletions(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 8c05e60d92..283a326d13 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -7,9 +7,9 @@
* Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
* stored in the radix tree.
*
- * A TidStore can be shared among parallel worker processes by passing DSA area
- * to tidstore_create(). Other backends can attach to the shared TidStore by
- * tidstore_attach().
+ * TidStore can be shared among parallel worker processes by passing DSA area
+ * to TidStoreCreate(). Other backends can attach to the shared TidStore by
+ * TidStoreAttach().
*
* Regarding the concurrency, it basically relies on the concurrency support in
* the radix tree, but we acquires the lock on a TidStore in some cases, for
@@ -106,7 +106,7 @@ typedef struct TidStoreControl
LWLock lock;
/* handles for TidStore and radix tree */
- tidstore_handle handle;
+ TidStoreHandle handle;
shared_rt_handle tree_handle;
} TidStoreControl;
@@ -164,7 +164,7 @@ static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_b
* The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
*/
TidStore *
-tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
{
TidStore *ts;
@@ -176,12 +176,12 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
* Memory consumption depends on the number of stored tids, but also on the
* distribution of them, how the radix tree stores, and the memory management
* that backed the radix tree. The maximum bytes that a TidStore can
- * use is specified by the max_bytes in tidstore_create(). We want the total
+ * use is specified by the max_bytes in TidStoreCreate(). We want the total
* amount of memory consumption by a TidStore not to exceed the max_bytes.
*
* In local TidStore cases, the radix tree uses slab allocators for each kind
* of node class. The most memory consuming case while adding Tids associated
- * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+ * with one page (i.e. during TidStoreSetBlockOffsets()) is that we allocate a new
* slab block for a new radix tree node, which is approximately 70kB. Therefore,
* we deduct 70kB from the max_bytes.
*
@@ -235,7 +235,7 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
* allocated in backend-local memory using the CurrentMemoryContext.
*/
TidStore *
-tidstore_attach(dsa_area *area, tidstore_handle handle)
+TidStoreAttach(dsa_area *area, TidStoreHandle handle)
{
TidStore *ts;
dsa_pointer control;
@@ -266,7 +266,7 @@ tidstore_attach(dsa_area *area, tidstore_handle handle)
* to the operating system.
*/
void
-tidstore_detach(TidStore *ts)
+TidStoreDetach(TidStore *ts)
{
Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
@@ -279,12 +279,12 @@ tidstore_detach(TidStore *ts)
*
* TODO: The caller must be certain that no other backend will attempt to
* access the TidStore before calling this function. Other backend must
- * explicitly call tidstore_detach to free up backend-local memory associated
- * with the TidStore. The backend that calls tidstore_destroy must not call
- * tidstore_detach.
+ * explicitly call TidStoreDetach() to free up backend-local memory associated
+ * with the TidStore. The backend that calls TidStoreDestroy() must not call
+ * TidStoreDetach().
*/
void
-tidstore_destroy(TidStore *ts)
+TidStoreDestroy(TidStore *ts)
{
if (TidStoreIsShared(ts))
{
@@ -309,11 +309,11 @@ tidstore_destroy(TidStore *ts)
}
/*
- * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * Forget all collected Tids. It's similar to TidStoreDestroy() but we don't free
* entire TidStore but recreate only the radix tree storage.
*/
void
-tidstore_reset(TidStore *ts)
+TidStoreReset(TidStore *ts)
{
if (TidStoreIsShared(ts))
{
@@ -352,8 +352,8 @@ tidstore_reset(TidStore *ts)
/* Add Tids on a block to TidStore */
void
-tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
- int num_offsets)
+TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
{
uint64 *values;
uint64 key;
@@ -431,7 +431,7 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
/* Return true if the given tid is present in the TidStore */
bool
-tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+TidStoreIsMember(TidStore *ts, ItemPointer tid)
{
uint64 key;
uint64 val = 0;
@@ -452,14 +452,16 @@ tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
}
/*
- * Prepare to iterate through a TidStore. Since the radix tree is locked during the
- * iteration, so tidstore_end_iterate() needs to called when finished.
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during
+ * the iteration, so TidStoreEndIterate() needs to be called when finished.
+ *
+ * The TidStoreIter struct is created in the caller's memory context.
*
* Concurrent updates during the iteration will be blocked when inserting a
* key-value to the radix tree.
*/
TidStoreIter *
-tidstore_begin_iterate(TidStore *ts)
+TidStoreBeginIterate(TidStore *ts)
{
TidStoreIter *iter;
@@ -477,7 +479,7 @@ tidstore_begin_iterate(TidStore *ts)
iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
/* If the TidStore is empty, there is no business */
- if (tidstore_num_tids(ts) == 0)
+ if (TidStoreNumTids(ts) == 0)
iter->finished = true;
return iter;
@@ -498,7 +500,7 @@ tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
* numbers in each result is also sorted in ascending order.
*/
TidStoreIterResult *
-tidstore_iterate_next(TidStoreIter *iter)
+TidStoreIterateNext(TidStoreIter *iter)
{
uint64 key;
uint64 val;
@@ -544,7 +546,7 @@ tidstore_iterate_next(TidStoreIter *iter)
* or when existing an iteration.
*/
void
-tidstore_end_iterate(TidStoreIter *iter)
+TidStoreEndIterate(TidStoreIter *iter)
{
if (TidStoreIsShared(iter->ts))
shared_rt_end_iterate(iter->tree_iter.shared);
@@ -557,7 +559,7 @@ tidstore_end_iterate(TidStoreIter *iter)
/* Return the number of tids we collected so far */
int64
-tidstore_num_tids(TidStore *ts)
+TidStoreNumTids(TidStore *ts)
{
uint64 num_tids;
@@ -575,16 +577,16 @@ tidstore_num_tids(TidStore *ts)
/* Return true if the current memory usage of TidStore exceeds the limit */
bool
-tidstore_is_full(TidStore *ts)
+TidStoreIsFull(TidStore *ts)
{
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
- return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+ return (TidStoreMemoryUsage(ts) > ts->control->max_bytes);
}
/* Return the maximum memory TidStore can use */
size_t
-tidstore_max_memory(TidStore *ts)
+TidStoreMaxMemory(TidStore *ts)
{
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
@@ -593,7 +595,7 @@ tidstore_max_memory(TidStore *ts)
/* Return the memory usage of TidStore */
size_t
-tidstore_memory_usage(TidStore *ts)
+TidStoreMemoryUsage(TidStore *ts)
{
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
@@ -611,8 +613,8 @@ tidstore_memory_usage(TidStore *ts)
/*
* Get a handle that can be used by other processes to attach to this TidStore
*/
-tidstore_handle
-tidstore_get_handle(TidStore *ts)
+TidStoreHandle
+TidStoreGetHandle(TidStore *ts)
{
Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 2c72088e69..be487aced6 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -842,7 +842,7 @@ lazy_scan_heap(LVRelState *vacrel)
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
+ initprog_val[2] = TidStoreMaxMemory(vacrel->dead_items);
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -909,7 +909,7 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- if (tidstore_is_full(vacrel->dead_items))
+ if (TidStoreIsFull(vacrel->dead_items))
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -1078,16 +1078,16 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(tidstore_num_tids(dead_items) == 0);
+ Assert(TidStoreNumTids(dead_items) == 0);
}
else if (prunestate.num_offsets > 0)
{
/* Save details of the LP_DEAD items from the page in dead_items */
- tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
- prunestate.num_offsets);
+ TidStoreSetBlockOffsets(dead_items, blkno, prunestate.deadoffsets,
+ prunestate.num_offsets);
pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
- tidstore_memory_usage(dead_items));
+ TidStoreMemoryUsage(dead_items));
}
/*
@@ -1258,7 +1258,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (tidstore_num_tids(dead_items) > 0)
+ if (TidStoreNumTids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -2125,10 +2125,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
+ TidStoreSetBlockOffsets(dead_items, blkno, deadoffsets, lpdead_items);
pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
- tidstore_memory_usage(dead_items));
+ TidStoreMemoryUsage(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2177,7 +2177,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- tidstore_reset(vacrel->dead_items);
+ TidStoreReset(vacrel->dead_items);
return;
}
@@ -2206,7 +2206,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
+ Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2234,7 +2234,7 @@ lazy_vacuum(LVRelState *vacrel)
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
bypass = (vacrel->lpdead_item_pages < threshold) &&
- tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
+ TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2279,7 +2279,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- tidstore_reset(vacrel->dead_items);
+ TidStoreReset(vacrel->dead_items);
}
/*
@@ -2352,7 +2352,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
+ TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || VacuumFailsafeActive);
/*
@@ -2407,8 +2407,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
VACUUM_ERRCB_PHASE_VACUUM_HEAP,
InvalidBlockNumber, InvalidOffsetNumber);
- iter = tidstore_begin_iterate(vacrel->dead_items);
- while ((result = tidstore_iterate_next(iter)) != NULL)
+ iter = TidStoreBeginIterate(vacrel->dead_items);
+ while ((iter_result = TidStoreIterateNext(iter)) != NULL)
{
BlockNumber blkno;
Buffer buf;
@@ -2442,7 +2442,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
vacuumed_pages++;
}
- tidstore_end_iterate(iter);
+ TidStoreEndIterate(iter);
vacrel->blkno = InvalidBlockNumber;
if (BufferIsValid(vmbuffer))
@@ -2453,12 +2453,12 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* the second heap pass. No more, no less.
*/
Assert(vacrel->num_index_scans > 1 ||
- (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
+ (TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
- vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+ (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, TidStoreNumTids(vacrel->dead_items),
vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
@@ -3125,8 +3125,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- vacrel->dead_items = tidstore_create(vac_work_mem, MaxHeapTuplesPerPage,
- NULL);
+ vacrel->dead_items = TidStoreCreate(vac_work_mem, MaxHeapTuplesPerPage,
+ NULL);
}
/*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index f3922b72dc..84f71fb14a 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2486,7 +2486,7 @@ vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
ereport(ivinfo->message_level,
(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- tidstore_num_tids(dead_items))));
+ TidStoreNumTids(dead_items))));
return istat;
}
@@ -2527,5 +2527,5 @@ vac_tid_reaped(ItemPointer itemptr, void *state)
{
TidStore *dead_items = (TidStore *) state;
- return tidstore_lookup_tid(dead_items, itemptr);
+ return TidStoreIsMember(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index c363f45e32..be83ceb871 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -110,7 +110,7 @@ typedef struct PVShared
pg_atomic_uint32 idx;
/* Handle of the shared TidStore */
- tidstore_handle dead_items_handle;
+ TidStoreHandle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -372,7 +372,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
LWTRANCHE_PARALLEL_VACUUM_DSA,
pcxt->seg);
- dead_items = tidstore_create(vac_work_mem, max_offset, dead_items_dsa);
+ dead_items = TidStoreCreate(vac_work_mem, max_offset, dead_items_dsa);
pvs->dead_items = dead_items;
pvs->dead_items_area = dead_items_dsa;
@@ -385,7 +385,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
- shared->dead_items_handle = tidstore_get_handle(dead_items);
+ shared->dead_items_handle = TidStoreGetHandle(dead_items);
/* Use the same buffer size for all workers */
shared->ring_nbuffers = GetAccessStrategyBufferCount(bstrategy);
@@ -454,7 +454,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
- tidstore_destroy(pvs->dead_items);
+ TidStoreDestroy(pvs->dead_items);
dsa_detach(pvs->dead_items_area);
DestroyParallelContext(pvs->pcxt);
@@ -1013,7 +1013,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Set dead items */
area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
dead_items_area = dsa_attach_in_place(area_space, seg);
- dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
+ dead_items = TidStoreAttach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumUpdateCosts();
@@ -1061,7 +1061,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
- tidstore_detach(pvs.dead_items);
+ TidStoreDetach(dead_items);
dsa_detach(dead_items_area);
/* Pop the error context stack */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index a35a52124a..f0a432d0da 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -17,7 +17,7 @@
#include "storage/itemptr.h"
#include "utils/dsa.h"
-typedef dsa_pointer tidstore_handle;
+typedef dsa_pointer TidStoreHandle;
typedef struct TidStore TidStore;
typedef struct TidStoreIter TidStoreIter;
@@ -29,21 +29,21 @@ typedef struct TidStoreIterResult
int num_offsets;
} TidStoreIterResult;
-extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
-extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
-extern void tidstore_detach(TidStore *ts);
-extern void tidstore_destroy(TidStore *ts);
-extern void tidstore_reset(TidStore *ts);
-extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
- int num_offsets);
-extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
-extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
-extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
-extern void tidstore_end_iterate(TidStoreIter *iter);
-extern int64 tidstore_num_tids(TidStore *ts);
-extern bool tidstore_is_full(TidStore *ts);
-extern size_t tidstore_max_memory(TidStore *ts);
-extern size_t tidstore_memory_usage(TidStore *ts);
-extern tidstore_handle tidstore_get_handle(TidStore *ts);
+extern TidStore *TidStoreCreate(size_t max_bytes, int max_off, dsa_area *dsa);
+extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer handle);
+extern void TidStoreDetach(TidStore *ts);
+extern void TidStoreDestroy(TidStore *ts);
+extern void TidStoreReset(TidStore *ts);
+extern void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool TidStoreIsMember(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * TidStoreBeginIterate(TidStore *ts);
+extern TidStoreIterResult *TidStoreIterateNext(TidStoreIter *iter);
+extern void TidStoreEndIterate(TidStoreIter *iter);
+extern int64 TidStoreNumTids(TidStore *ts);
+extern bool TidStoreIsFull(TidStore *ts);
+extern size_t TidStoreMaxMemory(TidStore *ts);
+extern size_t TidStoreMemoryUsage(TidStore *ts);
+extern TidStoreHandle TidStoreGetHandle(TidStore *ts);
#endif /* TIDSTORE_H */
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index 9a1217f833..12d3027624 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -37,10 +37,10 @@ check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
ItemPointerSet(&tid, blkno, off);
- found = tidstore_lookup_tid(ts, &tid);
+ found = TidStoreIsMember(ts, &tid);
if (found != expect)
- elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+ elog(ERROR, "TidStoreIsMember for TID (%u, %u) returned %d, expected %d",
blkno, off, found, expect);
}
@@ -69,9 +69,9 @@ test_basic(int max_offset)
LWLockRegisterTranche(tranche_id, "test_tidstore");
dsa = dsa_create(tranche_id);
- ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
+ ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
#else
- ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+ ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
#endif
/* prepare the offset array */
@@ -83,7 +83,7 @@ test_basic(int max_offset)
/* add tids */
for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
- tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+ TidStoreSetBlockOffsets(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
/* lookup test */
for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
@@ -105,30 +105,30 @@ test_basic(int max_offset)
}
/* test the number of tids */
- if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
- elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
- tidstore_num_tids(ts),
+ if (TidStoreNumTids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+ elog(ERROR, "TidStoreNumTids returned " UINT64_FORMAT ", expected %d",
+ TidStoreNumTids(ts),
TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
/* iteration test */
- iter = tidstore_begin_iterate(ts);
+ iter = TidStoreBeginIterate(ts);
blk_idx = 0;
- while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+ while ((iter_result = TidStoreIterateNext(iter)) != NULL)
{
/* check the returned block number */
if (blks_sorted[blk_idx] != iter_result->blkno)
- elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+ elog(ERROR, "TidStoreIterateNext returned block number %u, expected %u",
iter_result->blkno, blks_sorted[blk_idx]);
/* check the returned offset numbers */
if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
- elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+ elog(ERROR, "TidStoreIterateNext %u offsets, expected %u",
iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
for (int i = 0; i < iter_result->num_offsets; i++)
{
if (offs[i] != iter_result->offsets[i])
- elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+ elog(ERROR, "TidStoreIterateNext offset number %u on block %u, expected %u",
iter_result->offsets[i], iter_result->blkno, offs[i]);
}
@@ -136,15 +136,15 @@ test_basic(int max_offset)
}
if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
- elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+ elog(ERROR, "TidStoreIterateNext returned %d blocks, expected %d",
blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
/* remove all tids */
- tidstore_reset(ts);
+ TidStoreReset(ts);
/* test the number of tids */
- if (tidstore_num_tids(ts) != 0)
- elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+ if (TidStoreNumTids(ts) != 0)
+ elog(ERROR, "TidStoreNumTids on empty store returned non-zero");
/* lookup test for empty store */
for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
@@ -156,7 +156,7 @@ test_basic(int max_offset)
check_tid(ts, MaxBlockNumber, off, false);
}
- tidstore_destroy(ts);
+ TidStoreDestroy(ts);
#ifdef TEST_SHARED_TIDSTORE
dsa_detach(dsa);
@@ -177,36 +177,37 @@ test_empty(void)
LWLockRegisterTranche(tranche_id, "test_tidstore");
dsa = dsa_create(tranche_id);
- ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
+ ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
#else
- ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+ ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
#endif
elog(NOTICE, "testing empty tidstore");
ItemPointerSet(&tid, 0, FirstOffsetNumber);
- if (tidstore_lookup_tid(ts, &tid))
- elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+ if (TidStoreIsMember(ts, &tid))
+ elog(ERROR, "TidStoreIsMember for TID (%u,%u) on empty store returned true",
+ 0, FirstOffsetNumber);
ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
- if (tidstore_lookup_tid(ts, &tid))
- elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+ if (TidStoreIsMember(ts, &tid))
+ elog(ERROR, "TidStoreIsMember for TID (%u,%u) on empty store returned true",
MaxBlockNumber, MaxOffsetNumber);
- if (tidstore_num_tids(ts) != 0)
- elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+ if (TidStoreNumTids(ts) != 0)
+ elog(ERROR, "TidStoreNumTids on empty store returned non-zero");
- if (tidstore_is_full(ts))
- elog(ERROR, "tidstore_is_full on empty store returned true");
+ if (TidStoreIsFull(ts))
+ elog(ERROR, "TidStoreIsFull on empty store returned true");
- iter = tidstore_begin_iterate(ts);
+ iter = TidStoreBeginIterate(ts);
- if (tidstore_iterate_next(iter) != NULL)
- elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+ if (TidStoreIterateNext(iter) != NULL)
+ elog(ERROR, "TidStoreIterateNext on empty store returned TIDs");
- tidstore_end_iterate(iter);
+ TidStoreEndIterate(iter);
- tidstore_destroy(ts);
+ TidStoreDestroy(ts);
#ifdef TEST_SHARED_TIDSTORE
dsa_detach(dsa);
--
2.31.1
v32-0012-tidstore-Use-concept-of-off_upper-and-off_lower.patchapplication/octet-stream; name=v32-0012-tidstore-Use-concept-of-off_upper-and-off_lower.patchDownload
From 3f38c7722deb260e5cc4ac003ab37cfe959b1954 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 17:54:49 +0900
Subject: [PATCH v32 12/18] tidstore: Use concept of off_upper and off_lower.
The key is block number + the upper of offset number, whereas the
value is the bitmap representation of the lower offset number.
Updated function and variable names accordingly.
---
src/backend/access/common/tidstore.c | 191 ++++++++++++++-------------
1 file changed, 99 insertions(+), 92 deletions(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 283a326d13..d9fe3d5f15 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -65,8 +65,10 @@
*
* The maximum height of the radix tree is 5 in this case.
*/
-#define TIDSTORE_VALUE_NBITS 6 /* log(64, 2) */
-#define TIDSTORE_OFFSET_MASK ((1 << TIDSTORE_VALUE_NBITS) - 1)
+typedef uint64 tidkey;
+typedef uint64 offsetbm;
+#define LOWER_OFFSET_NBITS 6 /* log(sizeof(offsetbm), 2) */
+#define LOWER_OFFSET_MASK ((1 << LOWER_OFFSET_NBITS) - 1)
/* A magic value used to identify our TidStores. */
#define TIDSTORE_MAGIC 0x826f6a10
@@ -75,7 +77,7 @@
#define RT_SCOPE static
#define RT_DECLARE
#define RT_DEFINE
-#define RT_VALUE_TYPE uint64
+#define RT_VALUE_TYPE tidkey
#include "lib/radixtree.h"
#define RT_PREFIX shared_rt
@@ -83,7 +85,7 @@
#define RT_SCOPE static
#define RT_DECLARE
#define RT_DEFINE
-#define RT_VALUE_TYPE uint64
+#define RT_VALUE_TYPE tidkey
#include "lib/radixtree.h"
/* The control object for a TidStore */
@@ -94,10 +96,10 @@ typedef struct TidStoreControl
/* These values are never changed after creation */
size_t max_bytes; /* the maximum bytes a TidStore can use */
- int max_offset; /* the maximum offset number */
- int offset_nbits; /* the number of bits required for an offset
- * number */
- int offset_key_nbits; /* the number of bits of an offset number
+ int max_off; /* the maximum offset number */
+ int max_off_nbits; /* the number of bits required for offset
+ * numbers */
+ int upper_off_nbits; /* the number of bits of offset numbers
* used in a key */
/* The below fields are used only in shared case */
@@ -147,17 +149,18 @@ typedef struct TidStoreIter
bool finished;
/* save for the next iteration */
- uint64 next_key;
- uint64 next_val;
+ tidkey next_tidkey;
+ offsetbm next_off_bitmap;
/* output for the caller */
TidStoreIterResult result;
} TidStoreIter;
-static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
-static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
-static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit);
-static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit);
+static void iter_decode_key_off(TidStoreIter *iter, tidkey key, offsetbm off_bitmap);
+static inline BlockNumber key_get_blkno(TidStore *ts, tidkey key);
+static inline tidkey encode_blk_off(TidStore *ts, BlockNumber block,
+ OffsetNumber offset, offsetbm *off_bit);
+static inline tidkey encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit);
/*
* Create a TidStore. The returned object is allocated in backend-local memory.
@@ -218,14 +221,14 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
ts->control->max_bytes = max_bytes - (70 * 1024);
}
- ts->control->max_offset = max_offset;
- ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+ ts->control->max_off = max_off;
+ ts->control->max_off_nbits = pg_ceil_log2_32(max_off);
- if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
- ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
+ if (ts->control->max_off_nbits < LOWER_OFFSET_NBITS)
+ ts->control->max_off_nbits = LOWER_OFFSET_NBITS;
- ts->control->offset_key_nbits =
- ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+ ts->control->upper_off_nbits =
+ ts->control->max_off_nbits - LOWER_OFFSET_NBITS;
return ts;
}
@@ -355,25 +358,25 @@ void
TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
int num_offsets)
{
- uint64 *values;
- uint64 key;
- uint64 prev_key;
- uint64 off_bitmap = 0;
+ offsetbm *bitmaps;
+ tidkey key;
+ tidkey prev_key;
+ offsetbm off_bitmap = 0;
int idx;
- const uint64 key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
- const int nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+ const tidkey key_base = ((uint64) blkno) << ts->control->upper_off_nbits;
+ const int nkeys = UINT64CONST(1) << ts->control->upper_off_nbits;
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
- values = palloc(sizeof(uint64) * nkeys);
+ bitmaps = palloc(sizeof(offsetbm) * nkeys);
key = prev_key = key_base;
for (int i = 0; i < num_offsets; i++)
{
- uint64 off_bit;
+ offsetbm off_bit;
/* encode the tid to a key and partial offset */
- key = encode_key_off(ts, blkno, offsets[i], &off_bit);
+ key = encode_blk_off(ts, blkno, offsets[i], &off_bit);
/* make sure we scanned the line pointer array in order */
Assert(key >= prev_key);
@@ -384,11 +387,11 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
Assert(idx >= 0 && idx < nkeys);
/* write out offset bitmap for this key */
- values[idx] = off_bitmap;
+ bitmaps[idx] = off_bitmap;
/* zero out any gaps up to the current key */
for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
- values[empty_idx] = 0;
+ bitmaps[empty_idx] = 0;
/* reset for current key -- the current offset will be handled below */
off_bitmap = 0;
@@ -401,7 +404,7 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
/* save the final index for later */
idx = key - key_base;
/* write out last offset bitmap */
- values[idx] = off_bitmap;
+ bitmaps[idx] = off_bitmap;
if (TidStoreIsShared(ts))
LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
@@ -409,14 +412,14 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
/* insert the calculated key-values to the tree */
for (int i = 0; i <= idx; i++)
{
- if (values[i])
+ if (bitmaps[i])
{
key = key_base + i;
if (TidStoreIsShared(ts))
- shared_rt_set(ts->tree.shared, key, &values[i]);
+ shared_rt_set(ts->tree.shared, key, &bitmaps[i]);
else
- local_rt_set(ts->tree.local, key, &values[i]);
+ local_rt_set(ts->tree.local, key, &bitmaps[i]);
}
}
@@ -426,29 +429,29 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
if (TidStoreIsShared(ts))
LWLockRelease(&ts->control->lock);
- pfree(values);
+ pfree(bitmaps);
}
/* Return true if the given tid is present in the TidStore */
bool
TidStoreIsMember(TidStore *ts, ItemPointer tid)
{
- uint64 key;
- uint64 val = 0;
- uint64 off_bit;
+ tidkey key;
+ offsetbm off_bitmap = 0;
+ offsetbm off_bit;
bool found;
- key = tid_to_key_off(ts, tid, &off_bit);
+ key = encode_tid(ts, tid, &off_bit);
if (TidStoreIsShared(ts))
- found = shared_rt_search(ts->tree.shared, key, &val);
+ found = shared_rt_search(ts->tree.shared, key, &off_bitmap);
else
- found = local_rt_search(ts->tree.local, key, &val);
+ found = local_rt_search(ts->tree.local, key, &off_bitmap);
if (!found)
return false;
- return (val & off_bit) != 0;
+ return (off_bitmap & off_bit) != 0;
}
/*
@@ -486,12 +489,12 @@ TidStoreBeginIterate(TidStore *ts)
}
static inline bool
-tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+tidstore_iter(TidStoreIter *iter, tidkey *key, offsetbm *off_bitmap)
{
if (TidStoreIsShared(iter->ts))
- return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+ return shared_rt_iterate_next(iter->tree_iter.shared, key, off_bitmap);
- return local_rt_iterate_next(iter->tree_iter.local, key, val);
+ return local_rt_iterate_next(iter->tree_iter.local, key, off_bitmap);
}
/*
@@ -502,43 +505,46 @@ tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
TidStoreIterResult *
TidStoreIterateNext(TidStoreIter *iter)
{
- uint64 key;
- uint64 val;
- TidStoreIterResult *result = &(iter->result);
+ tidkey key;
+ offsetbm off_bitmap = 0;
+ TidStoreIterResult *output = &(iter->output);
if (iter->finished)
return NULL;
- if (BlockNumberIsValid(result->blkno))
- {
- /* Process the previously collected key-value */
- result->num_offsets = 0;
- tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
- }
+ /* Initialize the outputs */
+ output->blkno = InvalidBlockNumber;
+ output->num_offsets = 0;
- while (tidstore_iter_kv(iter, &key, &val))
- {
- BlockNumber blkno;
+ /*
+ * Decode the key and offset bitmap that are collected in the previous
+ * time, if exists.
+ */
+ if (iter->next_off_bitmap > 0)
+ iter_decode_key_off(iter, iter->next_tidkey, iter->next_off_bitmap);
- blkno = key_get_blkno(iter->ts, key);
+ while (tidstore_iter(iter, &key, &off_bitmap))
+ {
+ BlockNumber blkno = key_get_blkno(iter->ts, key);
- if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+ if (BlockNumberIsValid(output->blkno) && output->blkno != blkno)
{
/*
- * We got a key-value pair for a different block. So return the
- * collected tids, and remember the key-value for the next iteration.
+ * We got tids for a different block. We return the collected
+ * tids so far, and remember the key-value for the next
+ * iteration.
*/
- iter->next_key = key;
- iter->next_val = val;
- return result;
+ iter->next_tidkey = key;
+ iter->next_off_bitmap = off_bitmap;
+ return output;
}
- /* Collect tids extracted from the key-value pair */
- tidstore_iter_extract_tids(iter, key, val);
+ /* Collect tids decoded from the key and offset bitmap */
+ iter_decode_key_off(iter, key, off_bitmap);
}
iter->finished = true;
- return result;
+ return output;
}
/*
@@ -623,61 +629,62 @@ TidStoreGetHandle(TidStore *ts)
/* Extract tids from the given key-value pair */
static void
-tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+iter_decode_key_off(TidStoreIter *iter, tidkey key, offsetbm off_bitmap)
{
TidStoreIterResult *result = (&iter->result);
- while (val)
+ while (off_bitmap)
{
- uint64 tid_i;
+ uint64 compressed_tid;
OffsetNumber off;
- tid_i = key << TIDSTORE_VALUE_NBITS;
- tid_i |= pg_rightmost_one_pos64(val);
+ compressed_tid = key << LOWER_OFFSET_NBITS;
+ compressed_tid |= pg_rightmost_one_pos64(off_bitmap);
- off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+ off = compressed_tid & ((UINT64CONST(1) << iter->ts->control->max_off_nbits) - 1);
- Assert(result->num_offsets < iter->ts->control->max_offset);
- result->offsets[result->num_offsets++] = off;
+ Assert(output->num_offsets < iter->ts->control->max_off);
+ output->offsets[output->num_offsets++] = off;
/* unset the rightmost bit */
- val &= ~pg_rightmost_one64(val);
+ off_bitmap &= ~pg_rightmost_one64(off_bitmap);
}
- result->blkno = key_get_blkno(iter->ts, key);
+ output->blkno = key_get_blkno(iter->ts, key);
}
/* Get block number from the given key */
static inline BlockNumber
-key_get_blkno(TidStore *ts, uint64 key)
+key_get_blkno(TidStore *ts, tidkey key)
{
- return (BlockNumber) (key >> ts->control->offset_key_nbits);
+ return (BlockNumber) (key >> ts->control->upper_off_nbits);
}
-/* Encode a tid to key and offset */
-static inline uint64
-tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit)
+/* Encode a tid to key and partial offset */
+static inline tidkey
+encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit)
{
uint32 offset = ItemPointerGetOffsetNumber(tid);
BlockNumber block = ItemPointerGetBlockNumber(tid);
- return encode_key_off(ts, block, offset, off_bit);
+ return encode_blk_off(ts, block, offset, off_bit);
}
/* encode a block and offset to a key and partial offset */
-static inline uint64
-encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit)
+static inline tidkey
+encode_blk_off(TidStore *ts, BlockNumber block, OffsetNumber offset,
+ offsetbm *off_bit)
{
- uint64 key;
- uint64 tid_i;
+ tidkey key;
+ uint64 compressed_tid;
uint32 off_lower;
- off_lower = offset & TIDSTORE_OFFSET_MASK;
- Assert(off_lower < (sizeof(uint64) * BITS_PER_BYTE));
+ off_lower = offset & LOWER_OFFSET_MASK;
+ Assert(off_lower < (sizeof(offsetbm) * BITS_PER_BYTE));
*off_bit = UINT64CONST(1) << off_lower;
- tid_i = offset | ((uint64) block << ts->control->offset_nbits);
- key = tid_i >> TIDSTORE_VALUE_NBITS;
+ compressed_tid = offset | ((uint64) block << ts->control->max_off_nbits);
+ key = compressed_tid >> LOWER_OFFSET_NBITS;
return key;
}
--
2.31.1
v32-0009-radix-tree-Review-tree-iteration-code.patchapplication/octet-stream; name=v32-0009-radix-tree-Review-tree-iteration-code.patchDownload
From 989dd2cb442c1c2a6182bb5f7785c52f4d5cdb5e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 17:33:21 +0900
Subject: [PATCH v32 09/18] radix tree: Review tree iteration code
Cleanup the routines and improve comments and variable names.
---
src/include/lib/radixtree.h | 152 ++++++++++++++------------
src/include/lib/radixtree_iter_impl.h | 85 +++++++-------
2 files changed, 118 insertions(+), 119 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 088d1dfd9d..8bea606c62 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -83,7 +83,7 @@
* RT_SET - Set a key-value pair
* RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
* RT_ITERATE_NEXT - Return next key-value pair, if any
- * RT_END_ITER - End iteration
+ * RT_END_ITERATE - End iteration
* RT_MEMORY_USAGE - Get the memory usage
*
* Interface for Shared Memory
@@ -191,7 +191,7 @@
#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
-#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_SET_NODE_FROM RT_MAKE_NAME(iter_set_node_from)
#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
@@ -650,36 +650,40 @@ typedef struct RT_RADIX_TREE
* Iteration support.
*
* Iterating the radix tree returns each pair of key and value in the ascending
- * order of the key. To support this, the we iterate nodes of each level.
+ * order of the key.
*
- * RT_NODE_ITER struct is used to track the iteration within a node.
+ * RT_NODE_ITER is the struct for iteration of one radix tree node.
*
* RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
- * in order to track the iteration of each level. During iteration, we also
- * construct the key whenever updating the node iteration information, e.g., when
- * advancing the current index within the node or when moving to the next node
- * at the same level.
- *
- * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
- * has the local pointers to nodes, rather than RT_PTR_ALLOC.
- * We need either a safeguard to disallow other processes to begin the iteration
- * while one process is doing or to allow multiple processes to do the iteration.
+ * for each level to track the iteration within the node.
*/
typedef struct RT_NODE_ITER
{
- RT_PTR_LOCAL node; /* current node being iterated */
- int current_idx; /* current position. -1 for initial value */
+ /*
+ * Local pointer to the node we are iterating over.
+ *
+ * Since the radix tree doesn't support the shared iteration among multiple
+ * processes, we use RT_PTR_LOCAL rather than RT_PTR_ALLOC.
+ */
+ RT_PTR_LOCAL node;
+
+ /*
+ * The next index of the chunk array in RT_NODE_KIND_3 and
+ * RT_NODE_KIND_32 nodes, or the next chunk in RT_NODE_KIND_125 and
+ * RT_NODE_KIND_256 nodes. 0 for the initial value.
+ */
+ int idx;
} RT_NODE_ITER;
typedef struct RT_ITER
{
RT_RADIX_TREE *tree;
- /* Track the iteration on nodes of each level */
- RT_NODE_ITER stack[RT_MAX_LEVEL];
- int stack_len;
+ /* Track the nodes for each level. level = 0 is for a leaf node */
+ RT_NODE_ITER node_iters[RT_MAX_LEVEL];
+ int top_level;
- /* The key is constructed during iteration */
+ /* The key constructed during the iteration */
uint64 key;
} RT_ITER;
@@ -1804,16 +1808,9 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
}
#endif
-static inline void
-RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
-{
- iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
- iter->key |= (((uint64) chunk) << shift);
-}
-
/*
- * Advance the slot in the inner node. Return the child if exists, otherwise
- * null.
+ * Scan the inner node and return the next child node if exist, otherwise
+ * return NULL.
*/
static inline RT_PTR_LOCAL
RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
@@ -1824,8 +1821,8 @@ RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
}
/*
- * Advance the slot in the leaf node. On success, return true and the value
- * is set to value_p, otherwise return false.
+ * Scan the leaf node, and return true and the next value is set to value_p
+ * if exists. Otherwise return false.
*/
static inline bool
RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
@@ -1837,29 +1834,50 @@ RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
}
/*
- * Update each node_iter for inner nodes in the iterator node stack.
+ * While descending the radix tree from the 'from' node to the bottom, we
+ * set the next node to iterate for each level.
*/
static void
-RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+RT_ITER_SET_NODE_FROM(RT_ITER *iter, RT_PTR_LOCAL from)
{
- int level = from;
- RT_PTR_LOCAL node = from_node;
+ int level = from->shift / RT_NODE_SPAN;
+ RT_PTR_LOCAL node = from;
for (;;)
{
- RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+ RT_NODE_ITER *node_iter = &(iter->node_iters[level--]);
+
+#ifdef USE_ASSERT_CHECKING
+ if (node_iter->node)
+ {
+ /* We must have finished the iteration on the previous node */
+ if (RT_NODE_IS_LEAF(node_iter->node))
+ {
+ uint64 dummy;
+ Assert(!RT_NODE_LEAF_ITERATE_NEXT(iter, node_iter, &dummy));
+ }
+ else
+ Assert(!RT_NODE_INNER_ITERATE_NEXT(iter, node_iter));
+ }
+#endif
+ /* Set the node to the node iterator of this level */
node_iter->node = node;
- node_iter->current_idx = -1;
+ node_iter->idx = 0;
- /* We don't advance the leaf node iterator here */
if (RT_NODE_IS_LEAF(node))
- return;
+ {
+ /* We will visit the leaf node when RT_ITERATE_NEXT() */
+ break;
+ }
- /* Advance to the next slot in the inner node */
+ /*
+ * Get the first child node from the node, which corresponds to the
+ * lowest chunk within the node.
+ */
node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
- /* We must find the first children in the node */
+ /* The first child must be found */
Assert(node);
}
}
@@ -1873,14 +1891,11 @@ RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
RT_SCOPE RT_ITER *
RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
{
- MemoryContext old_ctx;
RT_ITER *iter;
RT_PTR_LOCAL root;
- int top_level;
- old_ctx = MemoryContextSwitchTo(tree->context);
-
- iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+ iter = (RT_ITER *) MemoryContextAllocZero(tree->context,
+ sizeof(RT_ITER));
iter->tree = tree;
RT_LOCK_SHARED(tree);
@@ -1890,16 +1905,13 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
return iter;
root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
- top_level = root->shift / RT_NODE_SPAN;
- iter->stack_len = top_level;
+ iter->top_level = root->shift / RT_NODE_SPAN;
/*
- * Descend to the left most leaf node from the root. The key is being
- * constructed while descending to the leaf.
+ * Set the next node to iterate for each level from the level of the
+ * root node.
*/
- RT_UPDATE_ITER_STACK(iter, root, top_level);
-
- MemoryContextSwitchTo(old_ctx);
+ RT_ITER_SET_NODE_FROM(iter, root);
return iter;
}
@@ -1911,6 +1923,8 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
RT_SCOPE bool
RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
{
+ Assert(value_p != NULL);
+
/* Empty tree */
if (!iter->tree->ctl->root)
return false;
@@ -1918,43 +1932,38 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
for (;;)
{
RT_PTR_LOCAL child = NULL;
- RT_VALUE_TYPE value;
- int level;
- bool found;
-
- /* Advance the leaf node iterator to get next key-value pair */
- found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
- if (found)
+ /* Get the next chunk of the leaf node */
+ if (RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->node_iters[0]), value_p))
{
*key_p = iter->key;
- *value_p = value;
return true;
}
/*
- * We've visited all values in the leaf node, so advance inner node
- * iterators from the level=1 until we find the next child node.
+ * We've visited all values in the leaf node, so advance all inner node
+ * iterators by visiting inner nodes from the level = 1 until we find the
+ * next inner node that has a child node.
*/
- for (level = 1; level <= iter->stack_len; level++)
+ for (int level = 1; level <= iter->top_level; level++)
{
- child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+ child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->node_iters[level]));
if (child)
break;
}
- /* the iteration finished */
+ /* We've visited all nodes, so the iteration finished */
if (!child)
- return false;
+ break;
/*
- * Set the node to the node iterator and update the iterator stack
- * from this node.
+ * Found the new child node. We update the next node to iterate for each
+ * level from the level of this child node.
*/
- RT_UPDATE_ITER_STACK(iter, child, level - 1);
+ RT_ITER_SET_NODE_FROM(iter, child);
- /* Node iterators are updated, so try again from the leaf */
+ /* Find key-value from the leaf node again */
}
return false;
@@ -2508,8 +2517,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_NODE_INSERT_LEAF
#undef RT_NODE_INNER_ITERATE_NEXT
#undef RT_NODE_LEAF_ITERATE_NEXT
-#undef RT_UPDATE_ITER_STACK
-#undef RT_ITER_UPDATE_KEY
+#undef RT_RT_ITER_SET_NODE_FROM
#undef RT_VERIFY_NODE
#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index 98c78eb237..5c1034768e 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -27,12 +27,10 @@
#error node level must be either inner or leaf
#endif
- bool found = false;
- uint8 key_chunk;
+ uint8 key_chunk = 0;
#ifdef RT_NODE_LEVEL_LEAF
- RT_VALUE_TYPE value;
-
+ Assert(value_p != NULL);
Assert(RT_NODE_IS_LEAF(node_iter->node));
#else
RT_PTR_LOCAL child = NULL;
@@ -50,99 +48,92 @@
{
RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
- node_iter->current_idx++;
- if (node_iter->current_idx >= n3->base.n.count)
- break;
+ if (node_iter->idx >= n3->base.n.count)
+ return false;
+
#ifdef RT_NODE_LEVEL_LEAF
- value = n3->values[node_iter->current_idx];
+ *value_p = n3->values[node_iter->idx];
#else
- child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+ child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->idx]);
#endif
- key_chunk = n3->base.chunks[node_iter->current_idx];
- found = true;
+ key_chunk = n3->base.chunks[node_iter->idx];
+ node_iter->idx++;
break;
}
case RT_NODE_KIND_32:
{
RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
- node_iter->current_idx++;
- if (node_iter->current_idx >= n32->base.n.count)
- break;
+ if (node_iter->idx >= n32->base.n.count)
+ return false;
#ifdef RT_NODE_LEVEL_LEAF
- value = n32->values[node_iter->current_idx];
+ *value_p = n32->values[node_iter->idx];
#else
- child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+ child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->idx]);
#endif
- key_chunk = n32->base.chunks[node_iter->current_idx];
- found = true;
+ key_chunk = n32->base.chunks[node_iter->idx];
+ node_iter->idx++;
break;
}
case RT_NODE_KIND_125:
{
RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
- int i;
+ int chunk;
- for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ for (chunk = node_iter->idx; chunk < RT_NODE_MAX_SLOTS; chunk++)
{
- if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+ if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, chunk))
break;
}
- if (i >= RT_NODE_MAX_SLOTS)
- break;
+ if (chunk >= RT_NODE_MAX_SLOTS)
+ return false;
- node_iter->current_idx = i;
#ifdef RT_NODE_LEVEL_LEAF
- value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+ *value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
#else
- child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, chunk));
#endif
- key_chunk = i;
- found = true;
+ key_chunk = chunk;
+ node_iter->idx = chunk + 1;
break;
}
case RT_NODE_KIND_256:
{
RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
- int i;
+ int chunk;
- for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ for (chunk = node_iter->idx; chunk < RT_NODE_MAX_SLOTS; chunk++)
{
#ifdef RT_NODE_LEVEL_LEAF
- if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+ if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
#else
- if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
#endif
break;
}
- if (i >= RT_NODE_MAX_SLOTS)
- break;
+ if (chunk >= RT_NODE_MAX_SLOTS)
+ return false;
- node_iter->current_idx = i;
#ifdef RT_NODE_LEVEL_LEAF
- value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+ *value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
#else
- child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, chunk));
#endif
- key_chunk = i;
- found = true;
+ key_chunk = chunk;
+ node_iter->idx = chunk + 1;
break;
}
}
- if (found)
- {
- RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
-#ifdef RT_NODE_LEVEL_LEAF
- *value_p = value;
-#endif
- }
+ /* Update the part of the key */
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << node_iter->node->shift);
+ iter->key |= (((uint64) key_chunk) << node_iter->node->shift);
#ifdef RT_NODE_LEVEL_LEAF
- return found;
+ return true;
#else
return child;
#endif
--
2.31.1
v32-0013-tidstore-Embed-output-offsets-in-TidStoreIterRes.patchapplication/octet-stream; name=v32-0013-tidstore-Embed-output-offsets-in-TidStoreIterRes.patchDownload
From 453dc7fd8078ba202569417d4ed65ce6e7f4a850 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 18:00:44 +0900
Subject: [PATCH v32 13/18] tidstore: Embed output offsets in
TidStoreIterResult.
---
src/backend/access/common/tidstore.c | 7 ++-----
src/include/access/tidstore.h | 3 ++-
2 files changed, 4 insertions(+), 6 deletions(-)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index d9fe3d5f15..15b77b5bcb 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -470,12 +470,10 @@ TidStoreBeginIterate(TidStore *ts)
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
- iter = palloc0(sizeof(TidStoreIter));
+ iter = palloc0(sizeof(TidStoreIter) +
+ sizeof(OffsetNumber) * ts->control->max_off);
iter->ts = ts;
- iter->result.blkno = InvalidBlockNumber;
- iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
-
if (TidStoreIsShared(ts))
iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
else
@@ -559,7 +557,6 @@ TidStoreEndIterate(TidStoreIter *iter)
else
local_rt_end_iterate(iter->tree_iter.local);
- pfree(iter->result.offsets);
pfree(iter);
}
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index f0a432d0da..66f0fdd482 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -22,11 +22,12 @@ typedef dsa_pointer TidStoreHandle;
typedef struct TidStore TidStore;
typedef struct TidStoreIter TidStoreIter;
+/* Result struct for TidStoreIterateNext */
typedef struct TidStoreIterResult
{
BlockNumber blkno;
- OffsetNumber *offsets;
int num_offsets;
+ OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
} TidStoreIterResult;
extern TidStore *TidStoreCreate(size_t max_bytes, int max_off, dsa_area *dsa);
--
2.31.1
v32-0008-radix-tree-remove-resolved-TODO.patchapplication/octet-stream; name=v32-0008-radix-tree-remove-resolved-TODO.patchDownload
From 84bad553eecc97bbc3d7ccacc90723ae22b7888f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 17:29:32 +0900
Subject: [PATCH v32 08/18] radix tree: remove resolved TODO
---
src/include/lib/radixtree.h | 1 -
1 file changed, 1 deletion(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index c277d5a484..088d1dfd9d 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -612,7 +612,6 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
#endif
/* Contains the actual tree and ancillary info */
-// WIP: this name is a bit strange
typedef struct RT_RADIX_TREE_CONTROL
{
#ifdef RT_SHMEM
--
2.31.1
v32-0007-radix-tree-rename-RT_EXTEND-and-RT_SET_EXTEND-to.patchapplication/octet-stream; name=v32-0007-radix-tree-rename-RT_EXTEND-and-RT_SET_EXTEND-to.patchDownload
From e25dc39fd502ae5c6c1c44a798a24dc5c6a1c7b0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 17:26:52 +0900
Subject: [PATCH v32 07/18] radix tree: rename RT_EXTEND and RT_SET_EXTEND to
RT_EXTEND_UP/DOWN
---
src/include/lib/radixtree.h | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index e546bd705c..c277d5a484 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -152,8 +152,8 @@
#define RT_INIT_NODE RT_MAKE_NAME(init_node)
#define RT_FREE_NODE RT_MAKE_NAME(free_node)
#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
-#define RT_EXTEND RT_MAKE_NAME(extend)
-#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_EXTEND_UP RT_MAKE_NAME(extend_up)
+#define RT_EXTEND_DOWN RT_MAKE_NAME(extend_down)
#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
@@ -1243,7 +1243,7 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
* it can store the key.
*/
static pg_noinline void
-RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+RT_EXTEND_UP(RT_RADIX_TREE *tree, uint64 key)
{
int target_shift;
RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
@@ -1282,7 +1282,7 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
* Insert inner and leaf nodes from 'node' to bottom.
*/
static pg_noinline void
-RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+RT_EXTEND_DOWN(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
{
int shift = node->shift;
@@ -1613,7 +1613,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
/* Extend the tree if necessary */
if (key > tree->ctl->max_val)
- RT_EXTEND(tree, key);
+ RT_EXTEND_UP(tree, key);
stored_child = tree->ctl->root;
parent = RT_PTR_GET_LOCAL(tree, stored_child);
@@ -1631,7 +1631,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
{
- RT_SET_EXTEND(tree, key, value_p, parent, stored_child, child);
+ RT_EXTEND_DOWN(tree, key, value_p, parent, stored_child, child);
RT_UNLOCK(tree);
return false;
}
@@ -2470,8 +2470,8 @@ RT_DUMP(RT_RADIX_TREE *tree)
#undef RT_INIT_NODE
#undef RT_FREE_NODE
#undef RT_FREE_RECURSE
-#undef RT_EXTEND
-#undef RT_SET_EXTEND
+#undef RT_EXTEND_UP
+#undef RT_EXTEND_DOWN
#undef RT_SWITCH_NODE_KIND
#undef RT_COPY_NODE
#undef RT_REPLACE_NODE
--
2.31.1
v32-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchapplication/octet-stream; name=v32-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload
From 1f6c4aa27d734b8c81369541481b0d3abd0d5dec Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 17:22:03 +0900
Subject: [PATCH v32 06/18] Use TIDStore for storing dead tuple TID during lazy
vacuum
Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which was not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.
Now we use TIDStore to store dead tuple TIDs. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.
Since we are no longer able to exactly estimate the maximum number of
TIDs can be stored the pg_stat_progress_vacuum shows the progress
information based on the amount of memory in bytes. The column names
are also changed to max_dead_tuple_bytes and num_dead_tuple_bytes.
In addition, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, the inital DSA
segment size. Due to that, we increase the minimum value of
maintenance_work_mem (also autovacuum_work_mem) from 1MB to 2MB.
XXX: needs to bump catalog version
---
doc/src/sgml/monitoring.sgml | 8 +-
src/backend/access/heap/vacuumlazy.c | 278 ++++++++-------------
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 78 +-----
src/backend/commands/vacuumparallel.c | 73 +++---
src/backend/postmaster/autovacuum.c | 6 +-
src/backend/storage/lmgr/lwlock.c | 2 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/commands/progress.h | 4 +-
src/include/commands/vacuum.h | 25 +-
src/include/storage/lwlock.h | 1 +
src/test/regress/expected/cluster.out | 2 +-
src/test/regress/expected/create_index.out | 2 +-
src/test/regress/expected/rules.out | 4 +-
src/test/regress/sql/cluster.sql | 2 +-
src/test/regress/sql/create_index.sql | 2 +-
16 files changed, 177 insertions(+), 314 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index be4448fe6e..9b64614beb 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7320,10 +7320,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>max_dead_tuples</structfield> <type>bigint</type>
+ <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples that we can store before needing to perform
+ Amount of dead tuple data that we can store before needing to perform
an index vacuum cycle, based on
<xref linkend="guc-maintenance-work-mem"/>.
</para></entry>
@@ -7331,10 +7331,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>num_dead_tuples</structfield> <type>bigint</type>
+ <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
</para>
<para>
- Number of dead tuples collected since the last index vacuum cycle.
+ Amount of dead tuple data collected since the last index vacuum cycle.
</para></entry>
</row>
</tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 0a9ebd22bd..2c72088e69 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,18 @@
* vacuumlazy.c
* Concurrent ("lazy") vacuuming.
*
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs
* that are to be removed from indexes. We want to ensure we can vacuum even
* the very largest relations with finite memory space usage. To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
*
* We are willing to use at most maintenance_work_mem (or perhaps
* autovacuum_work_mem) memory space to keep track of dead TIDs. We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables). If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * create a TidStore with the maximum bytes that can be used by the TidStore.
+ * If the TidStore is full, we must call lazy_vacuum to vacuum indexes (and to
+ * vacuum the pages that we've pruned). This frees up the memory space dedicated
+ * to storing dead TIDs.
*
* In practice VACUUM will often complete its initial pass over the target
* heap relation without ever running out of space to store TIDs. This means
@@ -40,6 +40,7 @@
#include "access/heapam_xlog.h"
#include "access/htup_details.h"
#include "access/multixact.h"
+#include "access/tidstore.h"
#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -186,7 +187,7 @@ typedef struct LVRelState
* lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
* LP_UNUSED during second heap pass.
*/
- VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ TidStore *dead_items; /* TIDs whose index tuples we'll delete */
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
BlockNumber removed_pages; /* # pages removed by relation truncation */
@@ -218,11 +219,14 @@ typedef struct LVRelState
typedef struct LVPagePruneState
{
bool hastup; /* Page prevents rel truncation? */
- bool has_lpdead_items; /* includes existing LP_DEAD items */
+
+ /* collected offsets of LP_DEAD items including existing ones */
+ OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+ int num_offsets;
/*
* State describes the proper VM bit states to set for the page following
- * pruning and freezing. all_visible implies !has_lpdead_items, but don't
+ * pruning and freezing. all_visible implies num_offsets == 0, but don't
* trust all_frozen result unless all_visible is also set to true.
*/
bool all_visible; /* Every item visible to all? */
@@ -257,8 +261,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
static void lazy_vacuum(LVRelState *vacrel);
static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
- Buffer buffer, int index, Buffer vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *offsets, int num_offsets,
+ Buffer buffer, Buffer vmbuffer);
static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
static void lazy_cleanup_all_indexes(LVRelState *vacrel);
static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -485,11 +490,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
}
/*
- * Allocate dead_items array memory using dead_items_alloc. This handles
- * parallel VACUUM initialization as part of allocating shared memory
- * space used for dead_items. (But do a failsafe precheck first, to
- * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
- * is already dangerously old.)
+ * Allocate dead_items memory using dead_items_alloc. This handles parallel
+ * VACUUM initialization as part of allocating shared memory space used for
+ * dead_items. (But do a failsafe precheck first, to ensure that parallel
+ * VACUUM won't be attempted at all when relfrozenxid is already dangerously
+ * old.)
*/
lazy_check_wraparound_failsafe(vacrel);
dead_items_alloc(vacrel, params->nworkers);
@@ -795,7 +800,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
* have collected the TIDs whose index tuples need to be removed.
*
* Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- * largely consists of marking LP_DEAD items (from collected TID array)
+ * largely consists of marking LP_DEAD items (from vacrel->dead_items)
* as LP_UNUSED. This has to happen in a second, final pass over the
* heap, to preserve a basic invariant that all index AMs rely on: no
* extant index tuple can ever be allowed to contain a TID that points to
@@ -823,21 +828,21 @@ lazy_scan_heap(LVRelState *vacrel)
blkno,
next_unskippable_block,
next_fsm_block_to_vacuum = 0;
- VacDeadItems *dead_items = vacrel->dead_items;
+ TidStore *dead_items = vacrel->dead_items;
Buffer vmbuffer = InvalidBuffer;
bool next_unskippable_allvis,
skipping_current_range;
const int initprog_index[] = {
PROGRESS_VACUUM_PHASE,
PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
- PROGRESS_VACUUM_MAX_DEAD_TUPLES
+ PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
};
int64 initprog_val[3];
/* Report that we're scanning the heap, advertising total # of blocks */
initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
initprog_val[1] = rel_pages;
- initprog_val[2] = dead_items->max_items;
+ initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
/* Set up an initial range of skippable blocks using the visibility map */
@@ -904,8 +909,7 @@ lazy_scan_heap(LVRelState *vacrel)
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
- Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
- if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+ if (tidstore_is_full(vacrel->dead_items))
{
/*
* Before beginning index vacuuming, we release any pin we may
@@ -967,7 +971,7 @@ lazy_scan_heap(LVRelState *vacrel)
continue;
}
- /* Collect LP_DEAD items in dead_items array, count tuples */
+ /* Collect LP_DEAD items in dead_items, count tuples */
if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
&recordfreespace))
{
@@ -1009,14 +1013,14 @@ lazy_scan_heap(LVRelState *vacrel)
* Prune, freeze, and count tuples.
*
* Accumulates details of remaining LP_DEAD line pointers on page in
- * dead_items array. This includes LP_DEAD line pointers that we
- * pruned ourselves, as well as existing LP_DEAD line pointers that
- * were pruned some time earlier. Also considers freezing XIDs in the
- * tuple headers of remaining items with storage.
+ * dead_items. This includes LP_DEAD line pointers that we pruned
+ * ourselves, as well as existing LP_DEAD line pointers that were pruned
+ * some time earlier. Also considers freezing XIDs in the tuple headers
+ * of remaining items with storage.
*/
lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
- Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+ Assert(!prunestate.all_visible || (prunestate.num_offsets == 0));
/* Remember the location of the last page with nonremovable tuples */
if (prunestate.hastup)
@@ -1032,14 +1036,12 @@ lazy_scan_heap(LVRelState *vacrel)
* performed here can be thought of as the one-pass equivalent of
* a call to lazy_vacuum().
*/
- if (prunestate.has_lpdead_items)
+ if (prunestate.num_offsets > 0)
{
Size freespace;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
- /* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+ prunestate.num_offsets, buf, vmbuffer);
/*
* Periodically perform FSM vacuuming to make newly-freed
@@ -1076,7 +1078,16 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(tidstore_num_tids(dead_items) == 0);
+ }
+ else if (prunestate.num_offsets > 0)
+ {
+ /* Save details of the LP_DEAD items from the page in dead_items */
+ tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
+ prunestate.num_offsets);
+
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
}
/*
@@ -1143,7 +1154,7 @@ lazy_scan_heap(LVRelState *vacrel)
* There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
* set, however.
*/
- else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+ else if ((prunestate.num_offsets > 0) && PageIsAllVisible(page))
{
elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
vacrel->relname, blkno);
@@ -1191,7 +1202,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Final steps for block: drop cleanup lock, record free space in the
* FSM
*/
- if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+ if ((prunestate.num_offsets > 0) && vacrel->do_index_vacuuming)
{
/*
* Wait until lazy_vacuum_heap_rel() to save free space. This
@@ -1247,7 +1258,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (tidstore_num_tids(dead_items) > 0)
lazy_vacuum(vacrel);
/*
@@ -1522,9 +1533,9 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
* The approach we take now is to restart pruning when the race condition is
* detected. This allows heap_page_prune() to prune the tuples inserted by
* the now-aborted transaction. This is a little crude, but it guarantees
- * that any items that make it into the dead_items array are simple LP_DEAD
- * line pointers, and that every remaining item with tuple storage is
- * considered as a candidate for freezing.
+ * that any items that make it into the dead_items are simple LP_DEAD line
+ * pointers, and that every remaining item with tuple storage is considered
+ * as a candidate for freezing.
*/
static void
lazy_scan_prune(LVRelState *vacrel,
@@ -1541,13 +1552,11 @@ lazy_scan_prune(LVRelState *vacrel,
HTSV_Result res;
int tuples_deleted,
tuples_frozen,
- lpdead_items,
live_tuples,
recently_dead_tuples;
int nnewlpdead;
HeapPageFreeze pagefrz;
int64 fpi_before = pgWalUsage.wal_fpi;
- OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1569,7 +1578,6 @@ retry:
pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
tuples_deleted = 0;
tuples_frozen = 0;
- lpdead_items = 0;
live_tuples = 0;
recently_dead_tuples = 0;
@@ -1578,9 +1586,9 @@ retry:
*
* We count tuples removed by the pruning step as tuples_deleted. Its
* final value can be thought of as the number of tuples that have been
- * deleted from the table. It should not be confused with lpdead_items;
- * lpdead_items's final value can be thought of as the number of tuples
- * that were deleted from indexes.
+ * deleted from the table. It should not be confused with
+ * prunestate->deadoffsets; prunestate->deadoffsets's final value can
+ * be thought of as the number of tuples that were deleted from indexes.
*/
tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
InvalidTransactionId, 0, &nnewlpdead,
@@ -1591,7 +1599,7 @@ retry:
* requiring freezing among remaining tuples with storage
*/
prunestate->hastup = false;
- prunestate->has_lpdead_items = false;
+ prunestate->num_offsets = 0;
prunestate->all_visible = true;
prunestate->all_frozen = true;
prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1636,7 +1644,7 @@ retry:
* (This is another case where it's useful to anticipate that any
* LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
*/
- deadoffsets[lpdead_items++] = offnum;
+ prunestate->deadoffsets[prunestate->num_offsets++] = offnum;
continue;
}
@@ -1873,7 +1881,7 @@ retry:
*/
#ifdef USE_ASSERT_CHECKING
/* Note that all_frozen value does not matter when !all_visible */
- if (prunestate->all_visible && lpdead_items == 0)
+ if (prunestate->all_visible && prunestate->num_offsets == 0)
{
TransactionId cutoff;
bool all_frozen;
@@ -1886,28 +1894,9 @@ retry:
}
#endif
- /*
- * Now save details of the LP_DEAD items from the page in vacrel
- */
- if (lpdead_items > 0)
+ if (prunestate->num_offsets > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
-
vacrel->lpdead_item_pages++;
- prunestate->has_lpdead_items = true;
-
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
/*
* It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1926,7 +1915,7 @@ retry:
/* Finally, add page-local counts to whole-VACUUM counts */
vacrel->tuples_deleted += tuples_deleted;
vacrel->tuples_frozen += tuples_frozen;
- vacrel->lpdead_items += lpdead_items;
+ vacrel->lpdead_items += prunestate->num_offsets;
vacrel->live_tuples += live_tuples;
vacrel->recently_dead_tuples += recently_dead_tuples;
}
@@ -1938,7 +1927,7 @@ retry:
* lazy_scan_prune, which requires a full cleanup lock. While pruning isn't
* performed here, it's quite possible that an earlier opportunistic pruning
* operation left LP_DEAD items behind. We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items for removal from indexes.
*
* For aggressive VACUUM callers, we may return false to indicate that a full
* cleanup lock is required for processing by lazy_scan_prune. This is only
@@ -2097,7 +2086,7 @@ lazy_scan_noprune(LVRelState *vacrel,
vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
vacrel->NewRelminMxid = NoFreezePageRelminMxid;
- /* Save any LP_DEAD items found on the page in dead_items array */
+ /* Save any LP_DEAD items found on the page in dead_items */
if (vacrel->nindexes == 0)
{
/* Using one-pass strategy (since table has no indexes) */
@@ -2127,8 +2116,7 @@ lazy_scan_noprune(LVRelState *vacrel,
}
else
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
+ TidStore *dead_items = vacrel->dead_items;
/*
* Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2137,17 +2125,10 @@ lazy_scan_noprune(LVRelState *vacrel,
*/
vacrel->lpdead_item_pages++;
- ItemPointerSetBlockNumber(&tmp, blkno);
+ tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
- dead_items->num_items);
+ pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+ tidstore_memory_usage(dead_items));
vacrel->lpdead_items += lpdead_items;
@@ -2196,7 +2177,7 @@ lazy_vacuum(LVRelState *vacrel)
if (!vacrel->do_index_vacuuming)
{
Assert(!vacrel->do_index_cleanup);
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
return;
}
@@ -2225,7 +2206,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2252,8 +2233,8 @@ lazy_vacuum(LVRelState *vacrel)
* cases then this may need to be reconsidered.
*/
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
- bypass = (vacrel->lpdead_item_pages < threshold &&
- vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
}
if (bypass)
@@ -2298,7 +2279,7 @@ lazy_vacuum(LVRelState *vacrel)
* Forget the LP_DEAD items that we just vacuumed (or just decided to not
* vacuum)
*/
- vacrel->dead_items->num_items = 0;
+ tidstore_reset(vacrel->dead_items);
}
/*
@@ -2371,7 +2352,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
* place).
*/
Assert(vacrel->num_index_scans > 0 ||
- vacrel->dead_items->num_items == vacrel->lpdead_items);
+ tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
Assert(allindexes || VacuumFailsafeActive);
/*
@@ -2390,9 +2371,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
/*
* lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
*
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
*
* We may also be able to truncate the line pointer array of the heap pages we
* visit. If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2408,10 +2388,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
- int index = 0;
BlockNumber vacuumed_pages = 0;
Buffer vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
Assert(vacrel->do_index_vacuuming);
Assert(vacrel->do_index_cleanup);
@@ -2426,7 +2407,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
VACUUM_ERRCB_PHASE_VACUUM_HEAP,
InvalidBlockNumber, InvalidOffsetNumber);
- while (index < vacrel->dead_items->num_items)
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ while ((result = tidstore_iterate_next(iter)) != NULL)
{
BlockNumber blkno;
Buffer buf;
@@ -2435,7 +2417,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vacuum_delay_point();
- blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+ blkno = result->blkno;
vacrel->blkno = blkno;
/*
@@ -2449,7 +2431,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
vacrel->bstrategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+ buf, vmbuffer);
/* Now that we've vacuumed the page, record its available space */
page = BufferGetPage(buf);
@@ -2459,6 +2442,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
vacuumed_pages++;
}
+ tidstore_end_iterate(iter);
vacrel->blkno = InvalidBlockNumber;
if (BufferIsValid(vmbuffer))
@@ -2468,36 +2452,31 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+ vacuumed_pages)));
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
}
/*
- * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- * vacrel->dead_items array.
+ * lazy_vacuum_heap_page() -- free page's LP_DEAD items.
*
* Caller must have an exclusive buffer lock on the buffer (though a full
* cleanup lock is also acceptable). vmbuffer must be valid and already have
* a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page. The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
*/
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *deadoffsets, int num_offsets, Buffer buffer,
+ Buffer vmbuffer)
{
- VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
int nunused = 0;
@@ -2516,16 +2495,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
START_CRIT_SECTION();
- for (; index < dead_items->num_items; index++)
+ for (int i = 0; i < num_offsets; i++)
{
- BlockNumber tblk;
- OffsetNumber toff;
ItemId itemid;
+ OffsetNumber toff = deadoffsets[i];
- tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
- if (tblk != blkno)
- break; /* past end of tuples for this block */
- toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
itemid = PageGetItemId(page, toff);
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2595,7 +2569,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
/* Revert to the previous phase information for error traceback */
restore_vacuum_error_info(vacrel, &saved_err_info);
- return index;
}
/*
@@ -2692,8 +2665,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
* lazy_vacuum_one_index() -- vacuum index relation.
*
* Delete all the index tuples containing a TID collected in
- * vacrel->dead_items array. Also update running statistics.
- * Exact details depend on index AM's ambulkdelete routine.
+ * vacrel->dead_items. Also update running statistics. Exact
+ * details depend on index AM's ambulkdelete routine.
*
* reltuples is the number of heap tuples to be passed to the
* bulkdelete callback. It's always assumed to be estimated.
@@ -3101,48 +3074,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
}
/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
- int64 max_items;
- int vac_work_mem = IsAutoVacuumWorkerProcess() &&
- autovacuum_work_mem != -1 ?
- autovacuum_work_mem : maintenance_work_mem;
-
- if (vacrel->nindexes > 0)
- {
- BlockNumber rel_pages = vacrel->rel_pages;
-
- max_items = MAXDEADITEMS(vac_work_mem * 1024L);
- max_items = Min(max_items, INT_MAX);
- max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
- /* curious coding here to ensure the multiplication can't overflow */
- if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
- max_items = rel_pages * MaxHeapTuplesPerPage;
-
- /* stay sane if small maintenance_work_mem */
- max_items = Max(max_items, MaxHeapTuplesPerPage);
- }
- else
- {
- /* One-pass case only stores a single heap page's TIDs at a time */
- max_items = MaxHeapTuplesPerPage;
- }
-
- return (int) max_items;
-}
-
-/*
- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
+ * in vacrel for caller.
*
* Also handles parallel initialization as part of allocating dead_items in
* DSM when required.
@@ -3150,11 +3083,9 @@ dead_items_max_items(LVRelState *vacrel)
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
- VacDeadItems *dead_items;
- int max_items;
-
- max_items = dead_items_max_items(vacrel);
- Assert(max_items >= MaxHeapTuplesPerPage);
+ int vac_work_mem = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
/*
* Initialize state for a parallel vacuum. As of now, only one worker can
@@ -3181,7 +3112,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
else
vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
vacrel->nindexes, nworkers,
- max_items,
+ vac_work_mem, MaxHeapTuplesPerPage,
vacrel->verbose ? INFO : DEBUG2,
vacrel->bstrategy);
@@ -3194,11 +3125,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
}
/* Serial VACUUM case */
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
-
- vacrel->dead_items = dead_items;
+ vacrel->dead_items = tidstore_create(vac_work_mem, MaxHeapTuplesPerPage,
+ NULL);
}
/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2129c916aa..134df925ce 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1190,7 +1190,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
END AS phase,
S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
- S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+ S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index a843f9ad92..f3922b72dc 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -119,7 +119,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int vac_cmp_itemptr(const void *left, const void *right);
/*
* GUC check function to ensure GUC value specified is within the allowable
@@ -2478,16 +2477,16 @@ get_vacoptval_from_boolean(DefElem *def)
*/
IndexBulkDeleteResult *
vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items)
+ TidStore *dead_items)
{
/* Do bulk deletion */
istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
(void *) dead_items);
ereport(ivinfo->message_level,
- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
RelationGetRelationName(ivinfo->index),
- dead_items->num_items)));
+ tidstore_num_tids(dead_items))));
return istat;
}
@@ -2518,82 +2517,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
return istat;
}
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
- Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
- return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
/*
* vac_tid_reaped() -- is a particular tid deletable?
*
* This has the right signature to be an IndexBulkDeleteCallback.
- *
- * Assumes dead_items array is sorted (in ascending TID order).
*/
static bool
vac_tid_reaped(ItemPointer itemptr, void *state)
{
- VacDeadItems *dead_items = (VacDeadItems *) state;
- int64 litem,
- ritem,
- item;
- ItemPointer res;
-
- litem = itemptr_encode(&dead_items->items[0]);
- ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
- item = itemptr_encode(itemptr);
-
- /*
- * Doing a simple bound check before bsearch() is useful to avoid the
- * extra cost of bsearch(), especially if dead items on the heap are
- * concentrated in a certain range. Since this function is called for
- * every index tuple, it pays to be really fast.
- */
- if (item < litem || item > ritem)
- return false;
-
- res = (ItemPointer) bsearch(itemptr,
- dead_items->items,
- dead_items->num_items,
- sizeof(ItemPointerData),
- vac_cmp_itemptr);
-
- return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
- BlockNumber lblk,
- rblk;
- OffsetNumber loff,
- roff;
-
- lblk = ItemPointerGetBlockNumber((ItemPointer) left);
- rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
- if (lblk < rblk)
- return -1;
- if (lblk > rblk)
- return 1;
-
- loff = ItemPointerGetOffsetNumber((ItemPointer) left);
- roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
- if (loff < roff)
- return -1;
- if (loff > roff)
- return 1;
+ TidStore *dead_items = (TidStore *) state;
- return 0;
+ return tidstore_lookup_tid(dead_items, itemptr);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 87ea5c5242..c363f45e32 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -9,12 +9,11 @@
* In a parallel vacuum, we perform both index bulk deletion and index cleanup
* with parallel worker processes. Individual indexes are processed by one
* vacuum process. ParalleVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment. We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit. Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * the shared TidStore. We launch parallel worker processes at the start of
+ * parallel index bulk-deletion and index cleanup and once all indexes are
+ * processed, the parallel worker processes exit. Each time we process indexes
+ * in parallel, the parallel context is re-initialized so that the same DSM can
+ * be used for multiple passes of index bulk-deletion and index cleanup.
*
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
@@ -109,6 +108,9 @@ typedef struct PVShared
/* Counter for vacuuming and cleanup */
pg_atomic_uint32 idx;
+
+ /* Handle of the shared TidStore */
+ tidstore_handle dead_items_handle;
} PVShared;
/* Status used during parallel index vacuum or cleanup */
@@ -175,7 +177,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ dsa_area *dead_items_area;
/* Points to buffer usage area in DSM */
BufferUsage *buffer_usage;
@@ -231,20 +234,23 @@ static void parallel_vacuum_error_callback(void *arg);
*/
ParallelVacuumState *
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int max_offset, int elevel,
+ BufferAccessStrategy bstrategy)
{
ParallelVacuumState *pvs;
ParallelContext *pcxt;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
PVIndStats *indstats;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ void *area_space;
+ dsa_area *dead_items_dsa;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
- Size est_dead_items_len;
+ Size dsa_minsize = dsa_minimum_size();
int nindexes_mwm = 0;
int parallel_workers = 0;
int querylen;
@@ -293,9 +299,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
@@ -361,6 +366,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
pvs->indstats = indstats;
+ /* Prepare DSA space for dead items */
+ area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+ dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
+ pcxt->seg);
+ dead_items = tidstore_create(vac_work_mem, max_offset, dead_items_dsa);
+ pvs->dead_items = dead_items;
+ pvs->dead_items_area = dead_items_dsa;
+
/* Prepare shared information */
shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
MemSet(shared, 0, est_shared_len);
@@ -370,6 +385,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
(nindexes_mwm > 0) ?
maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
maintenance_work_mem;
+ shared->dead_items_handle = tidstore_get_handle(dead_items);
/* Use the same buffer size for all workers */
shared->ring_nbuffers = GetAccessStrategyBufferCount(bstrategy);
@@ -381,15 +397,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
pvs->shared = shared;
- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;
-
/*
* Allocate space for each worker's BufferUsage and WalUsage; no need to
* initialize
@@ -447,6 +454,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
istats[i] = NULL;
}
+ tidstore_destroy(pvs->dead_items);
+ dsa_detach(pvs->dead_items_area);
+
DestroyParallelContext(pvs->pcxt);
ExitParallelMode();
@@ -455,7 +465,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
}
/* Returns the dead items space */
-VacDeadItems *
+TidStore *
parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
{
return pvs->dead_items;
@@ -954,7 +964,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
Relation *indrels;
PVIndStats *indstats;
PVShared *shared;
- VacDeadItems *dead_items;
+ TidStore *dead_items;
+ void *area_space;
+ dsa_area *dead_items_area;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -998,10 +1010,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PARALLEL_VACUUM_KEY_INDEX_STATS,
false);
- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+ dead_items_area = dsa_attach_in_place(area_space, seg);
+ dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
/* Set cost-based vacuum delay */
VacuumUpdateCosts();
@@ -1049,6 +1061,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ tidstore_detach(pvs.dead_items);
+ dsa_detach(dead_items_area);
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 53c8f8d79c..74915bee9b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3474,12 +3474,12 @@ check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
return true;
/*
- * We clamp manually-set values to at least 1MB. Since
+ * We clamp manually-set values to at least 2MB. Since
* maintenance_work_mem is always set to at least this value, do the same
* here.
*/
- if (*newval < 1024)
- *newval = 1024;
+ if (*newval < 2048)
+ *newval = 2048;
return true;
}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 55b3a04097..c223a7dc94 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -192,6 +192,8 @@ static const char *const BuiltinTrancheNames[] = {
"LogicalRepLauncherDSA",
/* LWTRANCHE_LAUNCHER_HASH: */
"LogicalRepLauncherHash",
+ /* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+ "ParallelVacuumDSA",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index cab3ddbe11..0bbdf04980 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2353,7 +2353,7 @@ struct config_int ConfigureNamesInt[] =
GUC_UNIT_KB
},
&maintenance_work_mem,
- 65536, 1024, MAX_KILOBYTES,
+ 65536, 2048, MAX_KILOBYTES,
NULL, NULL, NULL
},
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
#define PROGRESS_VACUUM_HEAP_BLKS_SCANNED 2
#define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED 3
#define PROGRESS_VACUUM_NUM_INDEX_VACUUMS 4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES 5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES 5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES 6
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 17e9b4f68e..b48c6ebf2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/genam.h"
#include "access/parallel.h"
+#include "access/tidstore.h"
#include "catalog/pg_class.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
@@ -280,21 +281,6 @@ struct VacuumCutoffs
MultiXactId MultiXactCutoff;
};
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
- (((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
/* GUC parameters */
extern PGDLLIMPORT int default_statistics_target; /* PGDLLIMPORT for PostGIS */
extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -347,10 +333,9 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
LOCKMODE lmode);
extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat,
- VacDeadItems *dead_items);
+ TidStore *dead_items);
extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
/* In postmaster/autovacuum.c */
extern void AutoVacuumUpdateCostLimit(void);
@@ -359,10 +344,10 @@ extern void VacuumUpdateCosts(void);
/* in commands/vacuumparallel.c */
extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
int nindexes, int nrequested_workers,
- int max_items, int elevel,
- BufferAccessStrategy bstrategy);
+ int vac_work_mem, int max_offset,
+ int elevel, BufferAccessStrategy bstrategy);
extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
long num_table_tuples,
int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 07002fdfbe..537b34b30c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DATA,
LWTRANCHE_LAUNCHER_DSA,
LWTRANCHE_LAUNCHER_HASH,
+ LWTRANCHE_PARALLEL_VACUUM_DSA,
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
-- ensure we don't use the index in CLUSTER nor the checking SELECTs
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index acfd9d1f4f..d320ad87dd 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
-- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 919d947ec0..66d671a641 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2041,8 +2041,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param3 AS heap_blks_scanned,
s.param4 AS heap_blks_vacuumed,
s.param5 AS index_vacuum_count,
- s.param6 AS max_dead_tuples,
- s.param7 AS num_dead_tuples
+ s.param6 AS max_dead_tuple_bytes,
+ s.param7 AS dead_tuple_bytes
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
set enable_indexscan = off;
-- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f300..d6e2471b00 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
-- Test hash index build tuplesorting. Force hash tuplesort using low
-- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
--
2.31.1
v32-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchapplication/octet-stream; name=v32-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchDownload
From cff1ffa9af592765cf9073291fb1665b09b61d8a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v32 05/18] Tool for measuring radix tree and tidstore
performance
Includes Meson support, but commented out to avoid warnings
XXX: Not for commit
---
contrib/bench_radix_tree/Makefile | 21 +
.../bench_radix_tree--1.0.sql | 88 +++
contrib/bench_radix_tree/bench_radix_tree.c | 747 ++++++++++++++++++
.../bench_radix_tree/bench_radix_tree.control | 6 +
contrib/bench_radix_tree/expected/bench.out | 13 +
contrib/bench_radix_tree/meson.build | 33 +
contrib/bench_radix_tree/sql/bench.sql | 16 +
contrib/meson.build | 1 +
8 files changed, 925 insertions(+)
create mode 100644 contrib/bench_radix_tree/Makefile
create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
create mode 100644 contrib/bench_radix_tree/expected/bench.out
create mode 100644 contrib/bench_radix_tree/meson.build
create mode 100644 contrib/bench_radix_tree/sql/bench.sql
diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+ bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..ad66265e23
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,88 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_tidstore_load(
+minblk int4,
+maxblk int4,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT iter_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..6e5149e2c4
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,747 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD 30
+#define TIDS_PER_BLOCK_FOR_LOOKUP 50
+
+//#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+PG_FUNCTION_INFO_V1(bench_tidstore_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+ uint64 upper;
+ uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+ int64 tid_i;
+
+ Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+ tid_i = ItemPointerGetOffsetNumber(tid);
+ tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+ /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+ *off = tid_i & ((1 << 6) - 1);
+ upper = tid_i >> 6;
+ Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+ Assert(*off < 64);
+
+ return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+ return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+ /* reproducability */
+ pg_prng_state state;
+
+ pg_prng_seed(&state, 0);
+
+ for (int i = 0; i < nitems - 1; i++)
+ {
+ int j = shuffle_randrange(&state, i, nitems - 1);
+ ItemPointerData t = itemptr[j];
+
+ itemptr[j] = itemptr[i];
+ itemptr[i] = t;
+ }
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+ bool random_block)
+{
+ ItemPointer tids;
+ uint64 maxitems;
+ uint64 ntids = 0;
+ pg_prng_state state;
+
+ maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+ tids = MemoryContextAllocHuge(TopTransactionContext,
+ sizeof(ItemPointerData) * maxitems);
+
+ if (random_block)
+ pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+ for (BlockNumber blk = minblk; blk < maxblk; blk++)
+ {
+ if (random_block && !pg_prng_bool(&state))
+ continue;
+
+ for (OffsetNumber off = FirstOffsetNumber;
+ off <= ntids_per_blk; off++)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+ ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+ ntids++;
+ }
+ }
+
+ *ntids_p = ntids;
+ return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+#endif
+
+Datum
+bench_tidstore_load(PG_FUNCTION_ARGS)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ TidStore *ts;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
+ OffsetNumber *offs;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_ms;
+ int64 iter_ms;
+ TupleDesc tupdesc;
+ Datum values[3];
+ bool nulls[3] = {false};
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ offs = palloc(sizeof(OffsetNumber) * TIDS_PER_BLOCK_FOR_LOAD);
+ for (int i = 0; i < TIDS_PER_BLOCK_FOR_LOAD; i++)
+ offs[i] = i + 1; /* FirstOffsetNumber is 1 */
+
+ ts = tidstore_create(1 * 1024L * 1024L * 1024L, MaxHeapTuplesPerPage, NULL);
+
+ /* load tids */
+ start_time = GetCurrentTimestamp();
+ for (BlockNumber blkno = minblk; blkno < maxblk; blkno++)
+ tidstore_add_tids(ts, blkno, offs, TIDS_PER_BLOCK_FOR_LOAD);
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_ms = secs * 1000 + usecs / 1000;
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* iterate through tids */
+ iter = tidstore_begin_iterate(ts);
+ start_time = GetCurrentTimestamp();
+ while ((result = tidstore_iterate_next(iter)) != NULL)
+ ;
+ tidstore_end_iterate(iter);
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ iter_ms = secs * 1000 + usecs / 1000;
+
+ values[0] = Int64GetDatum(tidstore_memory_usage(ts));
+ values[1] = Int64GetDatum(load_ms);
+ values[2] = Int64GetDatum(iter_ms);
+
+ tidstore_destroy(ts);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+ BlockNumber minblk = PG_GETARG_INT32(0);
+ BlockNumber maxblk = PG_GETARG_INT32(1);
+ bool random_block = PG_GETARG_BOOL(2);
+ rt_radix_tree *rt = NULL;
+ uint64 ntids;
+ uint64 key;
+ uint64 last_key = PG_UINT64_MAX;
+ uint64 val = 0;
+ ItemPointer tids;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[7];
+ bool nulls[7];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+ /* measure the load time of the radix tree */
+ rt = rt_create(CurrentMemoryContext);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ if (last_key != PG_UINT64_MAX && last_key != key)
+ {
+ rt_set(rt, last_key, &val);
+ val = 0;
+ }
+
+ last_key = key;
+ val |= (uint64) 1 << off;
+ }
+ if (last_key != PG_UINT64_MAX)
+ rt_set(rt, last_key, &val);
+
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+ if (shuffle)
+ shuffle_itemptrs(tids, ntids);
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ /* meaure the serach time of the radix tree */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ uint32 off;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ key = tid_to_key_off(tid, &off);
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_num_entries(rt));
+ values[1] = Int64GetDatum(rt_memory_usage(rt));
+ values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+ values[3] = Int64GetDatum(rt_load_ms);
+ nulls[4] = true; /* ar_load_ms */
+ values[5] = Int64GetDatum(rt_search_ms);
+ nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+ {
+ ItemPointer itemptrs = NULL;
+
+ int64 ar_load_ms,
+ ar_search_ms;
+
+ /* measure the load time of the array */
+ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(ItemPointerData) * ntids);
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointerSetBlockNumber(&(itemptrs[i]),
+ ItemPointerGetBlockNumber(&(tids[i])));
+ ItemPointerSetOffsetNumber(&(itemptrs[i]),
+ ItemPointerGetOffsetNumber(&(tids[i])));
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_load_ms = secs * 1000 + usecs / 1000;
+
+ /* next, measure the serach time of the array */
+ start_time = GetCurrentTimestamp();
+ for (int i = 0; i < ntids; i++)
+ {
+ ItemPointer tid = &(tids[i]);
+ volatile bool ret; /* prevent calling bsearch from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = bsearch((void *) tid,
+ (void *) itemptrs,
+ ntids,
+ sizeof(ItemPointerData),
+ vac_cmp_itemptr);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ ar_search_ms = secs * 1000 + usecs / 1000;
+
+ /* set the result */
+ nulls[4] = false;
+ values[4] = Int64GetDatum(ar_load_ms);
+ nulls[6] = false;
+ values[6] = Int64GetDatum(ar_search_ms);
+ }
+#endif
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+ return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ pg_prng_state state;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ Datum values[2];
+ bool nulls[2];
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ pg_prng_seed(&state, 0);
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 key = pg_prng_uint64(&state);
+
+ rt_set(rt, key, &key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+ x ^= x >> 30;
+ x *= UINT64CONST(0xbf58476d1ce4e5b9);
+ x ^= x >> 27;
+ x *= UINT64CONST(0x94d049bb133111eb);
+ x ^= x >> 31;
+ return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+ uint64 cnt = (uint64) PG_GETARG_INT64(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 load_time_ms;
+ int64 search_time_ms;
+ Datum values[3] = {0};
+ bool nulls[3] = {0};
+ /* from trial and error */
+ uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (!PG_ARGISNULL(1))
+ {
+ char *filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+ if (sscanf(filter_str, "0x%lX", &filter) == 0)
+ elog(ERROR, "invalid filter string %s", filter_str);
+ }
+ elog(NOTICE, "bench with filter 0x%lX", filter);
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 hash = hash64(i);
+ uint64 key = hash & filter;
+
+ rt_set(rt, key, &key);
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ load_time_ms = secs * 1000 + usecs / 1000;
+
+ elog(NOTICE, "sleeping for 2 seconds...");
+ pg_usleep(2 * 1000000L);
+
+ start_time = GetCurrentTimestamp();
+ for (uint64 i = 0; i < cnt; i++)
+ {
+ uint64 hash = hash64(i);
+ uint64 key = hash & filter;
+ uint64 val;
+ volatile bool ret; /* prevent calling rt_search from being
+ * optimized out */
+
+ CHECK_FOR_INTERRUPTS();
+
+ ret = rt_search(rt, key, &val);
+ (void) ret;
+ }
+ end_time = GetCurrentTimestamp();
+
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ search_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+
+ values[0] = Int64GetDatum(rt_memory_usage(rt));
+ values[1] = Int64GetDatum(load_time_ms);
+ values[2] = Int64GetDatum(search_time_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_load_ms,
+ rt_search_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ /* test boundary between vector and iteration */
+ const int n_keys = 5 * 16 * 16 * 16 * 16;
+ uint64 r,
+ h,
+ i,
+ j,
+ k;
+ uint64 key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+
+ /*
+ * lower nodes have limited fanout, the top is only limited by
+ * bits-per-byte
+ */
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_set;
+
+ rt_set(rt, key, &key_id);
+ }
+ }
+ }
+ }
+ }
+finish_set:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+
+ /* meaure the search time of the radix tree */
+ start_time = GetCurrentTimestamp();
+
+ key_id = 0;
+ for (r = 1;; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ for (i = 1; i <= fanout; i++)
+ {
+ for (j = 1; j <= fanout; j++)
+ {
+ for (k = 1; k <= fanout; k++)
+ {
+ uint64 key,
+ val;
+
+ key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+ CHECK_FOR_INTERRUPTS();
+
+ key_id++;
+ if (key_id > n_keys)
+ goto finish_search;
+
+ rt_search(rt, key, &val);
+ }
+ }
+ }
+ }
+ }
+finish_search:
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_search_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_load_ms);
+ values[4] = Int64GetDatum(rt_search_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+ int fanout = PG_GETARG_INT32(0);
+ rt_radix_tree *rt;
+ TupleDesc tupdesc;
+ TimestampTz start_time,
+ end_time;
+ long secs;
+ int usecs;
+ int64 rt_sparseload_ms;
+ Datum values[5];
+ bool nulls[5];
+
+ uint64 r,
+ h;
+ uint64 key_id;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ rt = rt_create(CurrentMemoryContext);
+
+ key_id = 0;
+
+ for (r = 1; r <= fanout; r++)
+ {
+ for (h = 1; h <= fanout; h++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (h);
+
+ key_id++;
+ rt_set(rt, key, &key_id);
+ }
+ }
+
+#ifdef RT_DEBUG
+ rt_stats(rt);
+#endif
+
+ /* measure sparse deletion and re-loading */
+ start_time = GetCurrentTimestamp();
+
+ for (int t = 0; t<10000; t++)
+ {
+ /* delete one key in each leaf */
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ rt_delete(rt, key);
+ }
+
+ /* add them all back */
+ key_id = 0;
+ for (r = 1; r <= fanout; r++)
+ {
+ uint64 key;
+
+ key = (r << 8) | (fanout);
+
+ key_id++;
+ rt_set(rt, key, &key_id);
+ }
+ }
+ end_time = GetCurrentTimestamp();
+ TimestampDifference(start_time, end_time, &secs, &usecs);
+ rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+ MemSet(nulls, false, sizeof(nulls));
+ values[0] = Int32GetDatum(fanout);
+ values[1] = Int64GetDatum(rt_num_entries(rt));
+ values[2] = Int64GetDatum(rt_memory_usage(rt));
+ values[3] = Int64GetDatum(rt_sparseload_ms);
+
+ rt_free(rt);
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* to silence warnings about unused iter functions */
+static void pg_attribute_unused()
+stub_iter()
+{
+ rt_radix_tree *rt;
+ rt_iter *iter;
+ uint64 key = 1;
+ uint64 value = 1;
+
+ rt = rt_create(CurrentMemoryContext);
+
+ iter = rt_begin_iterate(rt);
+ rt_iterate_next(iter, &key, &value);
+ rt_end_iterate(iter);
+}
\ No newline at end of file
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+ 'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+ bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'bench_radix_tree',
+ '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+ bench_radix_tree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+ 'bench_radix_tree.control',
+ 'bench_radix_tree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'bench_radix_tree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'bench_radix_tree',
+ ],
+ },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..421d469f8c 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
subdir('auth_delay')
subdir('auto_explain')
subdir('basic_archive')
+subdir('bench_radix_tree')
subdir('bloom')
subdir('basebackup_to_shell')
subdir('bool_plperl')
--
2.31.1
v32-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchapplication/octet-stream; name=v32-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload
From a804e3ebba8733d65497d5e9c3a47b32f175ea1e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v32 04/18] Add TIDStore, to store sets of TIDs
(ItemPointerData) efficiently.
The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.
The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.
This includes a unit test module, in src/test/modules/test_tidstore.
---
doc/src/sgml/monitoring.sgml | 4 +
src/backend/access/common/Makefile | 1 +
src/backend/access/common/meson.build | 1 +
src/backend/access/common/tidstore.c | 681 ++++++++++++++++++
src/backend/storage/lmgr/lwlock.c | 2 +
src/include/access/tidstore.h | 49 ++
src/include/storage/lwlock.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_tidstore/Makefile | 23 +
.../test_tidstore/expected/test_tidstore.out | 13 +
src/test/modules/test_tidstore/meson.build | 35 +
.../test_tidstore/sql/test_tidstore.sql | 7 +
.../test_tidstore/test_tidstore--1.0.sql | 8 +
.../modules/test_tidstore/test_tidstore.c | 226 ++++++
.../test_tidstore/test_tidstore.control | 4 +
16 files changed, 1057 insertions(+)
create mode 100644 src/backend/access/common/tidstore.c
create mode 100644 src/include/access/tidstore.h
create mode 100644 src/test/modules/test_tidstore/Makefile
create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
create mode 100644 src/test/modules/test_tidstore/meson.build
create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
create mode 100644 src/test/modules/test_tidstore/test_tidstore.control
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 2903b67170..be4448fe6e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2211,6 +2211,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Waiting to access a shared TID bitmap during a parallel bitmap
index scan.</entry>
</row>
+ <row>
+ <entry><literal>SharedTidStore</literal></entry>
+ <entry>Waiting to access a shared TID store.</entry>
+ </row>
<row>
<entry><literal>SharedTupleStore</literal></entry>
<entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+ 'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..8c05e60d92
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,681 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ * Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach().
+ *
+ * Regarding the concurrency, it basically relies on the concurrency support in
+ * the radix tree, but we acquires the lock on a TidStore in some cases, for
+ * example, when to reset the store and when to access the number tids in the
+ * store (num_tids).
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, tids are represented as a pair of 64-bit key and
+ * 64-bit value. First, we construct 64-bit unsigned integer by combining
+ * the block number and the offset number. The number of bits used for the
+ * offset number is specified by max_offsets in tidstore_create(). We are
+ * frugal with the bits, because smaller keys could help keeping the radix
+ * tree shallow.
+ *
+ * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. That
+ * is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits
+ * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
+ * as the key:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ * |----| value
+ * |---------------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ */
+#define TIDSTORE_VALUE_NBITS 6 /* log(64, 2) */
+#define TIDSTORE_OFFSET_MASK ((1 << TIDSTORE_VALUE_NBITS) - 1)
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The control object for a TidStore */
+typedef struct TidStoreControl
+{
+ /* the number of tids in the store */
+ int64 num_tids;
+
+ /* These values are never changed after creation */
+ size_t max_bytes; /* the maximum bytes a TidStore can use */
+ int max_offset; /* the maximum offset number */
+ int offset_nbits; /* the number of bits required for an offset
+ * number */
+ int offset_key_nbits; /* the number of bits of an offset number
+ * used in a key */
+
+ /* The below fields are used only in shared case */
+
+ uint32 magic;
+ LWLock lock;
+
+ /* handles for TidStore and radix tree */
+ tidstore_handle handle;
+ shared_rt_handle tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+ /*
+ * Control object. This is allocated in DSA area 'area' in the shared
+ * case, otherwise in backend-local memory.
+ */
+ TidStoreControl *control;
+
+ /* Storage for Tids. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ local_rt_radix_tree *local;
+ shared_rt_radix_tree *shared;
+ } tree;
+
+ /* DSA area for TidStore if used */
+ dsa_area *area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+ TidStore *ts;
+
+ /* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+ union
+ {
+ shared_rt_iter *shared;
+ local_rt_iter *local;
+ } tree_iter;
+
+ /* we returned all tids? */
+ bool finished;
+
+ /* save for the next iteration */
+ uint64 next_key;
+ uint64 next_val;
+
+ /* output for the caller */
+ TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+{
+ TidStore *ts;
+
+ ts = palloc0(sizeof(TidStore));
+
+ /*
+ * Create the radix tree for the main storage.
+ *
+ * Memory consumption depends on the number of stored tids, but also on the
+ * distribution of them, how the radix tree stores, and the memory management
+ * that backed the radix tree. The maximum bytes that a TidStore can
+ * use is specified by the max_bytes in tidstore_create(). We want the total
+ * amount of memory consumption by a TidStore not to exceed the max_bytes.
+ *
+ * In local TidStore cases, the radix tree uses slab allocators for each kind
+ * of node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+ * slab block for a new radix tree node, which is approximately 70kB. Therefore,
+ * we deduct 70kB from the max_bytes.
+ *
+ * In shared cases, DSA allocates the memory segments big enough to follow
+ * a geometric series that approximately doubles the total DSA size (see
+ * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+ * size and the simulation revealed, the 75% threshold for the maximum bytes
+ * perfectly works in case where the max_bytes is a power-of-2, and the 60%
+ * threshold works for other cases.
+ */
+ if (area != NULL)
+ {
+ dsa_pointer dp;
+ float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+ LWTRANCHE_SHARED_TIDSTORE);
+
+ dp = dsa_allocate0(area, sizeof(TidStoreControl));
+ ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+ ts->control->max_bytes = (uint64) (max_bytes * ratio);
+ ts->area = area;
+
+ ts->control->magic = TIDSTORE_MAGIC;
+ LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+ ts->control->handle = dp;
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+ }
+ else
+ {
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+ ts->control->max_bytes = max_bytes - (70 * 1024);
+ }
+
+ ts->control->max_offset = max_offset;
+ ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+
+ if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
+ ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
+
+ ts->control->offset_key_nbits =
+ ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+
+ return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
+
+ Assert(area != NULL);
+ Assert(DsaPointerIsValid(handle));
+
+ /* create per-backend state */
+ ts = palloc0(sizeof(TidStore));
+
+ /* Find the control object in shared memory */
+ control = handle;
+
+ /* Set up the TidStore */
+ ts->control = (TidStoreControl *) dsa_get_address(area, control);
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+ ts->area = area;
+
+ return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ shared_rt_detach(ts->tree.shared);
+ pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix
+ * tree.
+ */
+ ts->control->magic = 0;
+ dsa_free(ts->area, ts->control->handle);
+ shared_rt_free(ts->tree.shared);
+ }
+ else
+ {
+ pfree(ts->control);
+ local_rt_free(ts->tree.local);
+ }
+
+ pfree(ts);
+}
+
+/*
+ * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * entire TidStore but recreate only the radix tree storage.
+ */
+void
+tidstore_reset(TidStore *ts)
+{
+ if (TidStoreIsShared(ts))
+ {
+ Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /*
+ * Free the radix tree and return allocated DSA segments to
+ * the operating system.
+ */
+ shared_rt_free(ts->tree.shared);
+ dsa_trim(ts->area);
+
+ /* Recreate the radix tree */
+ ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+ LWTRANCHE_SHARED_TIDSTORE);
+
+ /* update the radix tree handle as we recreated it */
+ ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+
+ LWLockRelease(&ts->control->lock);
+ }
+ else
+ {
+ local_rt_free(ts->tree.local);
+ ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+ /* Reset the statistics */
+ ts->control->num_tids = 0;
+ }
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ uint64 *values;
+ uint64 key;
+ uint64 prev_key;
+ uint64 off_bitmap = 0;
+ int idx;
+ const uint64 key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+ const int nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ values = palloc(sizeof(uint64) * nkeys);
+ key = prev_key = key_base;
+
+ for (int i = 0; i < num_offsets; i++)
+ {
+ uint64 off_bit;
+
+ /* encode the tid to a key and partial offset */
+ key = encode_key_off(ts, blkno, offsets[i], &off_bit);
+
+ /* make sure we scanned the line pointer array in order */
+ Assert(key >= prev_key);
+
+ if (key > prev_key)
+ {
+ idx = prev_key - key_base;
+ Assert(idx >= 0 && idx < nkeys);
+
+ /* write out offset bitmap for this key */
+ values[idx] = off_bitmap;
+
+ /* zero out any gaps up to the current key */
+ for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
+ values[empty_idx] = 0;
+
+ /* reset for current key -- the current offset will be handled below */
+ off_bitmap = 0;
+ prev_key = key;
+ }
+
+ off_bitmap |= off_bit;
+ }
+
+ /* save the final index for later */
+ idx = key - key_base;
+ /* write out last offset bitmap */
+ values[idx] = off_bitmap;
+
+ if (TidStoreIsShared(ts))
+ LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+ /* insert the calculated key-values to the tree */
+ for (int i = 0; i <= idx; i++)
+ {
+ if (values[i])
+ {
+ key = key_base + i;
+
+ if (TidStoreIsShared(ts))
+ shared_rt_set(ts->tree.shared, key, &values[i]);
+ else
+ local_rt_set(ts->tree.local, key, &values[i]);
+ }
+ }
+
+ /* update statistics */
+ ts->control->num_tids += num_offsets;
+
+ if (TidStoreIsShared(ts))
+ LWLockRelease(&ts->control->lock);
+
+ pfree(values);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+ uint64 key;
+ uint64 val = 0;
+ uint64 off_bit;
+ bool found;
+
+ key = tid_to_key_off(ts, tid, &off_bit);
+
+ if (TidStoreIsShared(ts))
+ found = shared_rt_search(ts->tree.shared, key, &val);
+ else
+ found = local_rt_search(ts->tree.local, key, &val);
+
+ if (!found)
+ return false;
+
+ return (val & off_bit) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during the
+ * iteration, so tidstore_end_iterate() needs to called when finished.
+ *
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+ TidStoreIter *iter;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ iter = palloc0(sizeof(TidStoreIter));
+ iter->ts = ts;
+
+ iter->result.blkno = InvalidBlockNumber;
+ iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
+
+ if (TidStoreIsShared(ts))
+ iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+ else
+ iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+ /* If the TidStore is empty, there is no business */
+ if (tidstore_num_tids(ts) == 0)
+ iter->finished = true;
+
+ return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+
+ return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a pointer to TidStoreIterResult that has tids
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+ uint64 key;
+ uint64 val;
+ TidStoreIterResult *result = &(iter->result);
+
+ if (iter->finished)
+ return NULL;
+
+ if (BlockNumberIsValid(result->blkno))
+ {
+ /* Process the previously collected key-value */
+ result->num_offsets = 0;
+ tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+ }
+
+ while (tidstore_iter_kv(iter, &key, &val))
+ {
+ BlockNumber blkno;
+
+ blkno = key_get_blkno(iter->ts, key);
+
+ if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+ {
+ /*
+ * We got a key-value pair for a different block. So return the
+ * collected tids, and remember the key-value for the next iteration.
+ */
+ iter->next_key = key;
+ iter->next_val = val;
+ return result;
+ }
+
+ /* Collect tids extracted from the key-value pair */
+ tidstore_iter_extract_tids(iter, key, val);
+ }
+
+ iter->finished = true;
+ return result;
+}
+
+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+ if (TidStoreIsShared(iter->ts))
+ shared_rt_end_iterate(iter->tree_iter.shared);
+ else
+ local_rt_end_iterate(iter->tree_iter.local);
+
+ pfree(iter->result.offsets);
+ pfree(iter);
+}
+
+/* Return the number of tids we collected so far */
+int64
+tidstore_num_tids(TidStore *ts)
+{
+ uint64 num_tids;
+
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ if (!TidStoreIsShared(ts))
+ return ts->control->num_tids;
+
+ LWLockAcquire(&ts->control->lock, LW_SHARED);
+ num_tids = ts->control->num_tids;
+ LWLockRelease(&ts->control->lock);
+
+ return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+tidstore_max_memory(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+tidstore_memory_usage(TidStore *ts)
+{
+ Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+ return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+ Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+ return ts->control->handle;
+}
+
+/* Extract tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+ TidStoreIterResult *result = (&iter->result);
+
+ while (val)
+ {
+ uint64 tid_i;
+ OffsetNumber off;
+
+ tid_i = key << TIDSTORE_VALUE_NBITS;
+ tid_i |= pg_rightmost_one_pos64(val);
+
+ off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+
+ Assert(result->num_offsets < iter->ts->control->max_offset);
+ result->offsets[result->num_offsets++] = off;
+
+ /* unset the rightmost bit */
+ val &= ~pg_rightmost_one64(val);
+ }
+
+ result->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, uint64 key)
+{
+ return (BlockNumber) (key >> ts->control->offset_key_nbits);
+}
+
+/* Encode a tid to key and offset */
+static inline uint64
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit)
+{
+ uint32 offset = ItemPointerGetOffsetNumber(tid);
+ BlockNumber block = ItemPointerGetBlockNumber(tid);
+
+ return encode_key_off(ts, block, offset, off_bit);
+}
+
+/* encode a block and offset to a key and partial offset */
+static inline uint64
+encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit)
+{
+ uint64 key;
+ uint64 tid_i;
+ uint32 off_lower;
+
+ off_lower = offset & TIDSTORE_OFFSET_MASK;
+ Assert(off_lower < (sizeof(uint64) * BITS_PER_BYTE));
+
+ *off_bit = UINT64CONST(1) << off_lower;
+ tid_i = offset | ((uint64) block << ts->control->offset_nbits);
+ key = tid_i >> TIDSTORE_VALUE_NBITS;
+
+ return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
"SharedTupleStore",
/* LWTRANCHE_SHARED_TIDBITMAP: */
"SharedTidBitmap",
+ /* LWTRANCHE_SHARED_TIDSTORE: */
+ "SharedTidStore",
/* LWTRANCHE_PARALLEL_APPEND: */
"ParallelAppend",
/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..a35a52124a
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ * Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+ BlockNumber blkno;
+ OffsetNumber *offsets;
+ int num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern int64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern size_t tidstore_max_memory(TidStore *ts);
+extern size_t tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif /* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
LWTRANCHE_SHARED_TUPLESTORE,
LWTRANCHE_SHARED_TIDBITMAP,
+ LWTRANCHE_SHARED_TIDSTORE,
LWTRANCHE_PARALLEL_APPEND,
LWTRANCHE_PER_XACT_PREDICATE_LIST,
LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 89f42bf9e3..a6ec135430 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
test_rls_hooks \
test_shm_mq \
test_slru \
+ test_tidstore \
unsafe_tests \
worker_spi
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index beaf4080fb..f126ea9f2e 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -31,5 +31,6 @@ subdir('test_regex')
subdir('test_rls_hooks')
subdir('test_shm_mq')
subdir('test_slru')
+subdir('test_tidstore')
subdir('unsafe_tests')
subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+ $(WIN32RES) \
+ test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE: testing empty tidstore
+NOTICE: testing basic operations
+ test_tidstore
+---------------
+
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+ 'test_tidstore.c',
+)
+
+if host_system == 'windows'
+ test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_tidstore',
+ '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+ test_tidstore_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+ 'test_tidstore.control',
+ 'test_tidstore--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_tidstore',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_tidstore',
+ ],
+ },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..9a1217f833
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,226 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ * Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+/* #define TEST_SHARED_TIDSTORE 1 */
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+ ItemPointerData tid;
+ bool found;
+
+ ItemPointerSet(&tid, blkno, off);
+
+ found = tidstore_lookup_tid(ts, &tid);
+
+ if (found != expect)
+ elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+ blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS 5
+#define TEST_TIDSTORE_NUM_OFFSETS 5
+
+ TidStore *ts;
+ TidStoreIter *iter;
+ TidStoreIterResult *iter_result;
+ BlockNumber blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+ };
+ BlockNumber blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+ 0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+ };
+ OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+ int blk_idx;
+
+#ifdef TEST_SHARED_TIDSTORE
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_tidstore");
+ dsa = dsa_create(tranche_id);
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
+#else
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+#endif
+
+ /* prepare the offset array */
+ offs[0] = FirstOffsetNumber;
+ offs[1] = FirstOffsetNumber + 1;
+ offs[2] = max_offset / 2;
+ offs[3] = max_offset - 1;
+ offs[4] = max_offset;
+
+ /* add tids */
+ for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+ tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* lookup test */
+ for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+ {
+ bool expect = false;
+ for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+ {
+ if (offs[i] == off)
+ {
+ expect = true;
+ break;
+ }
+ }
+
+ check_tid(ts, 0, off, expect);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, expect);
+ }
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+ elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+ tidstore_num_tids(ts),
+ TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+ /* iteration test */
+ iter = tidstore_begin_iterate(ts);
+ blk_idx = 0;
+ while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+ {
+ /* check the returned block number */
+ if (blks_sorted[blk_idx] != iter_result->blkno)
+ elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+ iter_result->blkno, blks_sorted[blk_idx]);
+
+ /* check the returned offset numbers */
+ if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+ elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+ iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+ for (int i = 0; i < iter_result->num_offsets; i++)
+ {
+ if (offs[i] != iter_result->offsets[i])
+ elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+ iter_result->offsets[i], iter_result->blkno, offs[i]);
+ }
+
+ blk_idx++;
+ }
+
+ if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+ elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+ blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+ /* remove all tids */
+ tidstore_reset(ts);
+
+ /* test the number of tids */
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+ /* lookup test for empty store */
+ for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+ off++)
+ {
+ check_tid(ts, 0, off, false);
+ check_tid(ts, 2, off, false);
+ check_tid(ts, MaxBlockNumber - 2, off, false);
+ check_tid(ts, MaxBlockNumber, off, false);
+ }
+
+ tidstore_destroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+ dsa_detach(dsa);
+#endif
+}
+
+static void
+test_empty(void)
+{
+ TidStore *ts;
+ TidStoreIter *iter;
+ ItemPointerData tid;
+
+#ifdef TEST_SHARED_TIDSTORE
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_tidstore");
+ dsa = dsa_create(tranche_id);
+
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
+#else
+ ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+#endif
+
+ elog(NOTICE, "testing empty tidstore");
+
+ ItemPointerSet(&tid, 0, FirstOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+ ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+ if (tidstore_lookup_tid(ts, &tid))
+ elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+ MaxBlockNumber, MaxOffsetNumber);
+
+ if (tidstore_num_tids(ts) != 0)
+ elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+ if (tidstore_is_full(ts))
+ elog(ERROR, "tidstore_is_full on empty store returned true");
+
+ iter = tidstore_begin_iterate(ts);
+
+ if (tidstore_iterate_next(iter) != NULL)
+ elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+ tidstore_end_iterate(iter);
+
+ tidstore_destroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+ dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ elog(NOTICE, "testing basic operations");
+ test_basic(MaxHeapTuplesPerPage);
+ test_basic(10);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
--
2.31.1
v32-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchapplication/octet-stream; name=v32-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload
From 7fe0c744e052286a8c44716494fe4d644b0e8451 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v32 02/18] Move some bitmap logic out of bitmapset.c
Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.
Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().
Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.
Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
src/backend/nodes/bitmapset.c | 34 +-------------------------------
src/include/nodes/bitmapset.h | 16 +++++++++++++--
src/include/port/pg_bitutils.h | 31 +++++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 -
4 files changed, 46 insertions(+), 36 deletions(-)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 7ba3cf635b..0b2962ed73 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -30,39 +30,7 @@
#define BITMAPSET_SIZE(nwords) \
(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word. It assumes two's complement arithmetic. Consider any
- * nonzero value, and focus attention on the rightmost one. The value is
- * then something like
- * xxxxxx10000
- * where x's are unspecified bits. The two's complement negative is formed
- * by inverting all the bits and adding one. Inversion gives
- * yyyyyy01111
- * where each y is the inverse of the corresponding x. Incrementing gives
- * yyyyyy10000
- * and then ANDing with the original value gives
- * 00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x))
static bool bms_is_empty_internal(const Bitmapset *a);
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 14de6a9ff1..c7e1711147 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -36,13 +36,11 @@ struct List;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword; /* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
#else
#define BITS_PER_BITMAPWORD 32
typedef uint32 bitmapword; /* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
#endif
@@ -73,6 +71,20 @@ typedef enum
BMS_MULTIPLE /* >1 member */
} BMS_Membership;
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w) pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
+#define bmw_popcount(w) pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w) pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
+#define bmw_popcount(w) pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
/*
* function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 158ef73a2b..bf7588e075 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -32,6 +32,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word. It assumes two's complement arithmetic. Consider any
+ * nonzero value, and focus attention on the rightmost one. The value is
+ * then something like
+ * xxxxxx10000
+ * where x's are unspecified bits. The two's complement negative is formed
+ * by inverting all the bits and adding one. Inversion gives
+ * yyyyyy01111
+ * where each y is the inverse of the corresponding x. Incrementing gives
+ * yyyyyy10000
+ * and then ANDing with the original value gives
+ * 00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+ int32 result = (int32) word & -((int32) word);
+ return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+ int64 result = (int64) word & -((int64) word);
+ return (uint64) result;
+}
+
/*
* pg_leftmost_one_pos32
* Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b4058b88c3..fd3d83c781 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3684,7 +3684,6 @@ shmem_request_hook_type
shmem_startup_hook_type
sig_atomic_t
sigjmp_buf
-signedbitmapword
sigset_t
size_t
slist_head
--
2.31.1
v32-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchapplication/octet-stream; name=v32-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchDownload
From 51fe658fcecefb2b8c0d826c7d7d6070eb9e878c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v32 01/18] Introduce helper SIMD functions for small byte
arrays
vector8_min - helper for emulating ">=" semantics
vector8_highbit_mask - used to turn the result of a vector
comparison into a bitmask
Masahiko Sawada
Reviewed by Nathan Bossart, additional adjustments by me
Discussion: https://www.postgresql.org/message-id/CAD21AoDap240WDDdUDE0JMpCmuMMnGajrKrkCRxM7zn9Xk3JRA%40mail.gmail.com
---
src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 1fa6c3bc6c..dfae14e463 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -79,6 +79,7 @@ static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
#ifndef USE_NO_SIMD
static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
#endif
/* arithmetic operations */
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
*/
#ifndef USE_NO_SIMD
static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
#endif
@@ -299,6 +301,36 @@ vector32_is_highbit_set(const Vector32 v)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Return a bitmask formed from the high-bit of each element.
+ */
+#ifndef USE_NO_SIMD
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ /*
+ * Note: There is a faster way to do this, but it returns a uint64 and
+ * and if the caller wanted to extract the bit position using CTZ,
+ * it would have to divide that result by 4.
+ */
+ static const uint8 mask[16] = {
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ 1 << 0, 1 << 1, 1 << 2, 1 << 3,
+ 1 << 4, 1 << 5, 1 << 6, 1 << 7,
+ };
+
+ uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+ uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+ return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
/*
* Return the bitwise OR of the inputs
*/
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
+/*
+ * Given two vectors, return a vector with the minimum element of each.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+ return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+ return vminq_u8(v1, v2);
+#endif
+}
+#endif /* ! USE_NO_SIMD */
+
#endif /* SIMD_H */
--
2.31.1
v32-0003-Add-radixtree-template.patchapplication/octet-stream; name=v32-0003-Add-radixtree-template.patchDownload
From b88b152cac7c31b49416c4e59e93b3b5f0813759 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v32 03/18] Add radixtree template
WIP: commit message based on template comments
---
src/backend/utils/mmgr/dsa.c | 12 +
src/include/lib/radixtree.h | 2516 +++++++++++++++++
src/include/lib/radixtree_delete_impl.h | 122 +
src/include/lib/radixtree_insert_impl.h | 328 +++
src/include/lib/radixtree_iter_impl.h | 153 +
src/include/lib/radixtree_search_impl.h | 138 +
src/include/utils/dsa.h | 1 +
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_radixtree/.gitignore | 4 +
src/test/modules/test_radixtree/Makefile | 23 +
src/test/modules/test_radixtree/README | 7 +
.../expected/test_radixtree.out | 36 +
src/test/modules/test_radixtree/meson.build | 35 +
.../test_radixtree/sql/test_radixtree.sql | 7 +
.../test_radixtree/test_radixtree--1.0.sql | 8 +
.../modules/test_radixtree/test_radixtree.c | 681 +++++
.../test_radixtree/test_radixtree.control | 4 +
src/tools/pginclude/cpluspluscheck | 6 +
src/tools/pginclude/headerscheck | 6 +
20 files changed, 4089 insertions(+)
create mode 100644 src/include/lib/radixtree.h
create mode 100644 src/include/lib/radixtree_delete_impl.h
create mode 100644 src/include/lib/radixtree_insert_impl.h
create mode 100644 src/include/lib/radixtree_iter_impl.h
create mode 100644 src/include/lib/radixtree_search_impl.h
create mode 100644 src/test/modules/test_radixtree/.gitignore
create mode 100644 src/test/modules/test_radixtree/Makefile
create mode 100644 src/test/modules/test_radixtree/README
create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
create mode 100644 src/test/modules/test_radixtree/meson.build
create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
create mode 100644 src/test/modules/test_radixtree/test_radixtree.control
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..80555aefff 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
LWLockRelease(DSA_AREA_LOCK(area));
}
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return size;
+}
+
/*
* Aggressively free all spare memory in the hope of returning DSM segments to
* the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..e546bd705c
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2516 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ * Template for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ * tional leaf node type which stores one value.
+ * - Multi-value leaves: The values are stored in one of four
+ * different leaf node types, which mirror the structure of
+ * inner nodes, but contain values instead of pointers.
+ * - Combined pointer/value slots: If values fit into point-
+ * ers, no separate node types are necessary. Instead, each
+ * pointer storage location in an inner node can either
+ * store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * To handle concurrency, we use a single reader-writer lock for the radix
+ * tree. The radix tree is exclusively locked during write operations such
+ * as RT_SET() and RT_DELETE(), and shared locked during read operations
+ * such as RT_SEARCH(). An iteration also holds the shared lock on the radix
+ * tree until it is completed.
+ *
+ * TODO: The current locking mechanism is not optimized for high concurrency
+ * with mixed read-write workloads. In the future it might be worthwhile
+ * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
+ * the paper "The ART of Practical Synchronization" by the same authors as
+ * the ART paper, 2016.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included. Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * will result in radix tree type 'foo_radix_tree' and functions like
+ * 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ * generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ * declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
+ *
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ * so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE - Create a new, empty radix tree
+ * RT_FREE - Free the radix tree
+ * RT_SEARCH - Search a key-value pair
+ * RT_SET - Set a key-value pair
+ * RT_BEGIN_ITERATE - Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT - Return next key-value pair, if any
+ * RT_END_ITER - End iteration
+ * RT_MEMORY_USAGE - Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH - Attach to the radix tree
+ * RT_DETACH - Detach from the radix tree
+ * RT_GET_HANDLE - Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE - Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif /* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX 0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define RT_BM_IDX(x) ((x) / BITS_PER_BITMAPWORD)
+#define RT_BM_BIT(x) ((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ * statements.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ * in the future to tag the node pointer with the kind, even on
+ * platforms with 32-bit pointers. This might speed up node traversal
+ * in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_3 0x00
+#define RT_NODE_KIND_32 0x01
+#define RT_NODE_KIND_125 0x02
+#define RT_NODE_KIND_256 0x03
+#define RT_NODE_KIND_COUNT 4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+ /*
+ * Number of children. We use uint16 to be able to indicate 256 children
+ * at the fanout of 8.
+ */
+ uint16 count;
+
+ /*
+ * Max capacity for the current size class. Storing this in the
+ * node enables multiple size classes per node kind.
+ * Technically, kinds with a single size class don't need this, so we could
+ * keep this in the individual base types, but the code is simpler this way.
+ * Note: node256 is unique in that it cannot possibly have more than a
+ * single size class, so for that kind we store zero, and uint8 is
+ * sufficient for other kinds.
+ */
+ uint8 fanout;
+
+ /*
+ * Shift indicates which part of the key space is represented by this
+ * node. That is, the key is shifted by 'shift' and the lowest
+ * RT_NODE_SPAN bits are then represented in chunk.
+ */
+ uint8 shift;
+
+ /* Node kind, one per search/set algorithm */
+ uint8 kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree) LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree) LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree) LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree) ((void) 0)
+#define RT_LOCK_SHARED(tree) ((void) 0)
+#define RT_UNLOCK(tree) ((void) 0)
+#endif
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+#define RT_NODE_IS_LEAF(n) (((RT_PTR_LOCAL) (n))->shift == 0)
+
+#define RT_NODE_MUST_GROW(node) \
+ ((node)->base.n.count == (node)->base.n.fanout)
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_3
+{
+ RT_NODE n;
+
+ /* 3 children, for key chunks */
+ uint8 chunks[3];
+} RT_NODE_BASE_3;
+
+typedef struct RT_NODE_BASE_32
+{
+ RT_NODE n;
+
+ /* 32 children, for key chunks */
+ uint8 chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_125
+{
+ RT_NODE n;
+
+ /* The index of slots for each fanout */
+ uint8 slot_idxs[RT_NODE_MAX_SLOTS];
+
+ /* bitmap to track which slots are in use */
+ bitmapword isset[RT_BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+ RT_NODE n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * These are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_3
+{
+ RT_NODE_BASE_3 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_3;
+
+typedef struct RT_NODE_LEAF_3
+{
+ RT_NODE_BASE_3 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_3;
+
+typedef struct RT_NODE_INNER_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+ RT_NODE_BASE_32 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of children depends on size class */
+ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+ RT_NODE_BASE_125 base;
+
+ /* number of values depends on size class */
+ RT_VALUE_TYPE values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+ RT_NODE_BASE_256 base;
+
+ /* Slots for 256 children */
+ RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+ RT_NODE_BASE_256 base;
+
+ /*
+ * Unlike with inner256, zero is a valid value here, so we use a
+ * bitmap to track which slots are in use.
+ */
+ bitmapword isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)];
+
+ /* Slots for 256 values */
+ RT_VALUE_TYPE values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+ RT_CLASS_3 = 0,
+ RT_CLASS_32_MIN,
+ RT_CLASS_32_MAX,
+ RT_CLASS_125,
+ RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+ const char *name;
+ int fanout;
+
+ /* slab chunk size */
+ Size inner_size;
+ Size leaf_size;
+} RT_SIZE_CLASS_ELEM;
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+ [RT_CLASS_3] = {
+ .name = "radix tree node 3",
+ .fanout = 3,
+ .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_32_MIN] = {
+ .name = "radix tree node 15",
+ .fanout = 15,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_32_MAX] = {
+ .name = "radix tree node 32",
+ .fanout = 32,
+ .inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_125] = {
+ .name = "radix tree node 125",
+ .fanout = 125,
+ .inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+ .leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
+ },
+ [RT_CLASS_256] = {
+ .name = "radix tree node 256",
+ .fanout = 256,
+ .inner_size = sizeof(RT_NODE_INNER_256),
+ .leaf_size = sizeof(RT_NODE_LEAF_256),
+ },
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+ RT_HANDLE handle;
+ uint32 magic;
+ LWLock lock;
+#endif
+
+ RT_PTR_ALLOC root;
+ uint64 max_val;
+ uint64 num_keys;
+
+ /* statistics */
+#ifdef RT_DEBUG
+ int32 cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+ MemoryContext context;
+
+ /* pointing to either local memory or DSA */
+ RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+ dsa_area *dsa;
+#else
+ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+ MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+ RT_PTR_LOCAL node; /* current node being iterated */
+ int current_idx; /* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+ RT_RADIX_TREE *tree;
+
+ /* Track the iteration on nodes of each level */
+ RT_NODE_ITER stack[RT_MAX_LEVEL];
+ int stack_len;
+
+ /* The key is constructed during iteration */
+ uint64 key;
+} RT_ITER;
+
+
+static void RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_VALUE_TYPE *value_p);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+ return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+ return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+ return DsaPointerIsValid(ptr);
+#else
+ return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+ int idx = -1;
+
+ for (int i = 0; i < node->n.count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+ int idx;
+
+ for (idx = 0; idx < node->n.count; idx++)
+ {
+ if (node->chunks[idx] >= chunk)
+ break;
+ }
+
+ return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ uint32 bitfield;
+ int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index = -1;
+
+ for (int i = 0; i < count; i++)
+ {
+ if (node->chunks[i] == chunk)
+ {
+ index = i;
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ /* replicate the search key */
+ spread_chunk = vector8_broadcast(chunk);
+
+ /* compare to all 32 keys stored in the node */
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ cmp1 = vector8_eq(spread_chunk, haystack1);
+ cmp2 = vector8_eq(spread_chunk, haystack2);
+
+ /* convert comparison to a bitfield */
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+ /* mask off invalid entries */
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ /* convert bitfield to index by counting trailing zeros */
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+ int count = node->n.count;
+#ifndef USE_NO_SIMD
+ Vector8 spread_chunk;
+ Vector8 haystack1;
+ Vector8 haystack2;
+ Vector8 cmp1;
+ Vector8 cmp2;
+ Vector8 min1;
+ Vector8 min2;
+ uint32 bitfield;
+ int index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+ int index;
+
+ for (index = 0; index < count; index++)
+ {
+ /*
+ * This is coded with '>=' to match what we can do with SIMD,
+ * with an assert to keep us honest.
+ */
+ if (node->chunks[index] >= chunk)
+ {
+ Assert(node->chunks[index] != chunk);
+ break;
+ }
+ }
+#endif
+
+#ifndef USE_NO_SIMD
+ /*
+ * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+ * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+ * we need to play some trickery using vector8_min() to effectively get
+ * >=. There'll never be any equal elements in current uses, but that's
+ * what we get here...
+ */
+ spread_chunk = vector8_broadcast(chunk);
+ vector8_load(&haystack1, &node->chunks[0]);
+ vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+ min1 = vector8_min(spread_chunk, haystack1);
+ min2 = vector8_min(spread_chunk, haystack2);
+ cmp1 = vector8_eq(spread_chunk, min1);
+ cmp2 = vector8_eq(spread_chunk, min2);
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+ bitfield &= ((UINT64CONST(1) << count) - 1);
+
+ if (bitfield)
+ index_simd = pg_rightmost_one_pos32(bitfield);
+ else
+ index_simd = count;
+
+ Assert(index_simd == index);
+ return index_simd;
+#else
+ return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+ memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+ memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+ memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+ memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+ uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+ uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+ const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+ const Size chunk_size = sizeof(uint8) * fanout;
+ const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+ memcpy(dst_chunks, src_chunks, chunk_size);
+ memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+ return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+ Assert(RT_NODE_IS_LEAF(node));
+ Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
+ return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+ return node->children[chunk];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ Assert(RT_NODE_IS_LEAF(node));
+ Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+ return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ node->isset[idx] |= ((bitmapword) 1 << bitnum);
+ node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+ Assert(!RT_NODE_IS_LEAF(node));
+ node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+ int idx = RT_BM_IDX(chunk);
+ int bitnum = RT_BM_BIT(chunk);
+
+ Assert(RT_NODE_IS_LEAF(node));
+ node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+ if (key == 0)
+ return 0;
+ else
+ return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+ if (shift == RT_MAX_SHIFT)
+ return UINT64_MAX;
+
+ return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+ RT_PTR_ALLOC allocnode;
+ size_t allocsize;
+
+ if (is_leaf)
+ allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+ else
+ allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+ allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+ if (is_leaf)
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+ allocsize);
+ else
+ allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+ allocsize);
+#endif
+
+#ifdef RT_DEBUG
+ /* update the statistics */
+ tree->ctl->cnt[size_class]++;
+#endif
+
+ return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+ if (is_leaf)
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+ else
+ MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+
+ node->kind = kind;
+
+ if (kind == RT_NODE_KIND_256)
+ /* See comment for the RT_NODE type */
+ Assert(node->fanout == 0);
+ else
+ node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+ /* Initialize slot_idxs to invalid values */
+ if (kind == RT_NODE_KIND_125)
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+ memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
+ }
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static pg_noinline void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+ int shift = RT_KEY_GET_SHIFT(key);
+ bool is_leaf = shift == 0;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ newnode->shift = shift;
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+ tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+ newnode->shift = oldnode->shift;
+ newnode->count = oldnode->count;
+}
+
+/*
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
+ */
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+ uint8 new_kind, uint8 new_class, bool is_leaf)
+{
+ RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
+ RT_COPY_NODE(newnode, node);
+
+ return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+ /* If we're deleting the root node, make the tree empty */
+ if (tree->ctl->root == allocnode)
+ {
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+ tree->ctl->max_val = 0;
+ }
+
+#ifdef RT_DEBUG
+ {
+ int i;
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ /* update the statistics */
+ for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+ break;
+ }
+
+ /* fanout of node256 is intentionally 0 */
+ if (i == RT_SIZE_CLASS_COUNT)
+ i = RT_CLASS_256;
+
+ tree->ctl->cnt[i]--;
+ Assert(tree->ctl->cnt[i] >= 0);
+ }
+#endif
+
+#ifdef RT_SHMEM
+ dsa_free(tree->dsa, allocnode);
+#else
+ pfree(allocnode);
+#endif
+}
+
+/* Update the parent's pointer when growing a node */
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static inline void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
+ RT_PTR_ALLOC new_child, uint64 key)
+{
+#ifdef USE_ASSERT_CHECKING
+ RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+ Assert(old_child->shift == new->shift);
+ Assert(old_child->count == new->count);
+#endif
+
+ if (parent == old_child)
+ {
+ /* Replace the root node with the new larger node */
+ tree->ctl->root = new_child;
+ }
+ else
+ RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+ RT_FREE_NODE(tree, stored_old_child);
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static pg_noinline void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+ int target_shift;
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ int shift = root->shift + RT_NODE_SPAN;
+
+ target_shift = RT_KEY_GET_SHIFT(key);
+
+ /* Grow tree from 'shift' to 'target_shift' */
+ while (shift <= target_shift)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ RT_NODE_INNER_3 *n3;
+
+ allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+ node->shift = shift;
+ node->count = 1;
+
+ n3 = (RT_NODE_INNER_3 *) node;
+ n3->base.chunks[0] = 0;
+ n3->children[0] = tree->ctl->root;
+
+ /* Update the root */
+ tree->ctl->root = allocnode;
+
+ shift += RT_NODE_SPAN;
+ }
+
+ tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static pg_noinline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+ RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
+{
+ int shift = node->shift;
+
+ Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
+
+ while (shift >= RT_NODE_SPAN)
+ {
+ RT_PTR_ALLOC allocchild;
+ RT_PTR_LOCAL newchild;
+ int newshift = shift - RT_NODE_SPAN;
+ bool is_leaf = newshift == 0;
+
+ allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+ newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+ RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+ newchild->shift = newshift;
+ RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
+
+ parent = node;
+ node = newchild;
+ stored_node = allocchild;
+ shift -= RT_NODE_SPAN;
+ }
+
+ RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value_p);
+ tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static void
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+ uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+ RT_RADIX_TREE *tree;
+ MemoryContext old_ctx;
+#ifdef RT_SHMEM
+ dsa_pointer dp;
+#endif
+
+ old_ctx = MemoryContextSwitchTo(ctx);
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+ tree->context = ctx;
+
+#ifdef RT_SHMEM
+ tree->dsa = dsa;
+ dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+ tree->ctl->handle = dp;
+ tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+ LWLockInitialize(&tree->ctl->lock, tranche_id);
+#else
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+ /* Create a slab context for each size class */
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+ size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+ size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
+ tree->inner_slabs[i] = SlabContextCreate(ctx,
+ size_class.name,
+ inner_blocksize,
+ size_class.inner_size);
+ tree->leaf_slabs[i] = SlabContextCreate(ctx,
+ size_class.name,
+ leaf_blocksize,
+ size_class.leaf_size);
+ }
+#endif
+
+ tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+ RT_RADIX_TREE *tree;
+ dsa_pointer control;
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ /* Find the control object in shard memory */
+ control = handle;
+
+ tree->dsa = dsa;
+ tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();
+
+ /* The leaf node doesn't have child pointers */
+ if (RT_NODE_IS_LEAF(node))
+ {
+ dsa_free(tree->dsa, ptr);
+ return;
+ }
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+ for (int i = 0; i < n3->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n3->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ for (int i = 0; i < n32->base.n.count; i++)
+ RT_FREE_RECURSE(tree, n32->children[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+ }
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+ }
+
+ break;
+ }
+ }
+
+ /* Free the inner node */
+ dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+ /* Free all memory used for radix tree nodes */
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_FREE_RECURSE(tree, tree->ctl->root);
+
+ /*
+ * Vandalize the control block to help catch programming error where
+ * other backends access the memory formerly occupied by this radix tree.
+ */
+ tree->ctl->magic = 0;
+ dsa_free(tree->dsa, tree->ctl->handle);
+#else
+ pfree(tree->ctl);
+
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ MemoryContextDelete(tree->inner_slabs[i]);
+ MemoryContextDelete(tree->leaf_slabs[i]);
+ }
+#endif
+
+ pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+ int shift;
+ bool updated;
+ RT_PTR_LOCAL parent;
+ RT_PTR_ALLOC stored_child;
+ RT_PTR_LOCAL child;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ RT_LOCK_EXCLUSIVE(tree);
+
+ /* Empty tree, create the root */
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ RT_NEW_ROOT(tree, key);
+
+ /* Extend the tree if necessary */
+ if (key > tree->ctl->max_val)
+ RT_EXTEND(tree, key);
+
+ stored_child = tree->ctl->root;
+ parent = RT_PTR_GET_LOCAL(tree, stored_child);
+ shift = parent->shift;
+
+ /* Descend the tree until we reach a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;
+
+ child = RT_PTR_GET_LOCAL(tree, stored_child);
+
+ if (RT_NODE_IS_LEAF(child))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
+ {
+ RT_SET_EXTEND(tree, key, value_p, parent, stored_child, child);
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ parent = child;
+ stored_child = new_child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value_p);
+
+ /* Update the statistics */
+ if (!updated)
+ tree->ctl->num_keys++;
+
+ RT_UNLOCK(tree);
+ return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *value_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+ RT_PTR_LOCAL node;
+ int shift;
+ bool found;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+ Assert(value_p != NULL);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+ shift = node->shift;
+
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+ if (RT_NODE_IS_LEAF(node))
+ break;
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ node = RT_PTR_GET_LOCAL(tree, child);
+ shift -= RT_NODE_SPAN;
+ }
+
+ found = RT_NODE_SEARCH_LEAF(node, key, value_p);
+
+ RT_UNLOCK(tree);
+ return found;
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_LOCAL node;
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+ int shift;
+ int level;
+ bool deleted;
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ RT_LOCK_EXCLUSIVE(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ /*
+ * Descend the tree to search the key while building a stack of nodes we
+ * visited.
+ */
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ level = -1;
+ while (shift > 0)
+ {
+ RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+ /* Push the current node to the stack */
+ stack[++level] = allocnode;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ {
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ allocnode = child;
+ shift -= RT_NODE_SPAN;
+ }
+
+ /* Delete the key from the leaf node if exists */
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_LEAF(node, key);
+
+ if (!deleted)
+ {
+ /* no key is found in the leaf node */
+ RT_UNLOCK(tree);
+ return false;
+ }
+
+ /* Found the key to delete. Update the statistics */
+ tree->ctl->num_keys--;
+
+ /*
+ * Return if the leaf node still has keys and we don't need to delete the
+ * node.
+ */
+ if (node->count > 0)
+ {
+ RT_UNLOCK(tree);
+ return true;
+ }
+
+ /* Free the empty leaf node */
+ RT_FREE_NODE(tree, allocnode);
+
+ /* Delete the key in inner nodes recursively */
+ while (level >= 0)
+ {
+ allocnode = stack[level--];
+
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ deleted = RT_NODE_DELETE_INNER(node, key);
+ Assert(deleted);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (node->count > 0)
+ break;
+
+ /* The node became empty */
+ RT_FREE_NODE(tree, allocnode);
+ }
+
+ RT_UNLOCK(tree);
+ return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+ iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+ iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+ RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+ int level = from;
+ RT_PTR_LOCAL node = from_node;
+
+ for (;;)
+ {
+ RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+ node_iter->node = node;
+ node_iter->current_idx = -1;
+
+ /* We don't advance the leaf node iterator here */
+ if (RT_NODE_IS_LEAF(node))
+ return;
+
+ /* Advance to the next slot in the inner node */
+ node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+ /* We must find the first children in the node */
+ Assert(node);
+ }
+}
+
+/*
+ * Create and return the iterator for the given radix tree.
+ *
+ * The radix tree is locked in shared mode during the iteration, so
+ * RT_END_ITERATE needs to be called when finished to release the lock.
+ */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+ MemoryContext old_ctx;
+ RT_ITER *iter;
+ RT_PTR_LOCAL root;
+ int top_level;
+
+ old_ctx = MemoryContextSwitchTo(tree->context);
+
+ iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+ iter->tree = tree;
+
+ RT_LOCK_SHARED(tree);
+
+ /* empty tree */
+ if (!iter->tree->ctl->root)
+ return iter;
+
+ root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+ top_level = root->shift / RT_NODE_SPAN;
+ iter->stack_len = top_level;
+
+ /*
+ * Descend to the left most leaf node from the root. The key is being
+ * constructed while descending to the leaf.
+ */
+ RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+ MemoryContextSwitchTo(old_ctx);
+
+ return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key. Otherwise
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+ /* Empty tree */
+ if (!iter->tree->ctl->root)
+ return false;
+
+ for (;;)
+ {
+ RT_PTR_LOCAL child = NULL;
+ RT_VALUE_TYPE value;
+ int level;
+ bool found;
+
+ /* Advance the leaf node iterator to get next key-value pair */
+ found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+ if (found)
+ {
+ *key_p = iter->key;
+ *value_p = value;
+ return true;
+ }
+
+ /*
+ * We've visited all values in the leaf node, so advance inner node
+ * iterators from the level=1 until we find the next child node.
+ */
+ for (level = 1; level <= iter->stack_len; level++)
+ {
+ child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+ if (child)
+ break;
+ }
+
+ /* the iteration finished */
+ if (!child)
+ return false;
+
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+ /* Node iterators are updated, so try again from the leaf */
+ }
+
+ return false;
+}
+
+/*
+ * Terminate the iteration and release the lock.
+ *
+ * This function needs to be called after finishing or when exiting an
+ * iteration.
+ */
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+#ifdef RT_SHMEM
+ Assert(LWLockHeldByMe(&iter->tree->ctl->lock));
+#endif
+
+ RT_UNLOCK(iter->tree);
+ pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+ Size total = 0;
+
+ RT_LOCK_SHARED(tree);
+
+#ifdef RT_SHMEM
+ Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+ total = dsa_get_total_size(tree->dsa);
+#else
+ for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+ {
+ total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+ total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+ }
+#endif
+
+ RT_UNLOCK(tree);
+ return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+ Assert(node->count >= 0);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
+
+ for (int i = 1; i < n3->n.count; i++)
+ Assert(n3->chunks[i - 1] < n3->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+ for (int i = 1; i < n32->n.count; i++)
+ Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ uint8 slot = n125->slot_idxs[i];
+ int idx = RT_BM_IDX(slot);
+ int bitnum = RT_BM_BIT(slot);
+
+ if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+ continue;
+
+ /* Check if the corresponding slot is used */
+ Assert(slot < node->fanout);
+ Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+ cnt++;
+ }
+
+ Assert(n125->n.count == cnt);
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+ int cnt = 0;
+
+ for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
+ cnt += bmw_popcount(n256->isset[i]);
+
+ /* Check if the number of used chunk matches */
+ Assert(n256->base.n.count == cnt);
+
+ break;
+ }
+ }
+ }
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+ RT_LOCK_SHARED(tree);
+
+ fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+ fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+ fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+ if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+
+ fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+ root->shift / RT_NODE_SPAN,
+ tree->ctl->cnt[RT_CLASS_3],
+ tree->ctl->cnt[RT_CLASS_32_MIN],
+ tree->ctl->cnt[RT_CLASS_32_MAX],
+ tree->ctl->cnt[RT_CLASS_125],
+ tree->ctl->cnt[RT_CLASS_256]);
+ }
+
+ RT_UNLOCK(tree);
+}
+
+static void
+RT_DUMP_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, int level,
+ bool recurse, StringInfo buf)
+{
+ RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+ StringInfoData spaces;
+
+ initStringInfo(&spaces);
+ appendStringInfoSpaces(&spaces, (level * 4) + 1);
+
+ appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u, shift %u:\n",
+ spaces.data,
+ level == 0 ? "" : "-> ",
+ RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+ (node->kind == RT_NODE_KIND_3) ? 3 :
+ (node->kind == RT_NODE_KIND_32) ? 32 :
+ (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+ node->fanout == 0 ? 256 : node->fanout,
+ node->count, node->shift);
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+ spaces.data, i, n3->base.chunks[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X",
+ spaces.data, i, n3->base.chunks[i]);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, n3->children[i], level + 1,
+ recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ for (int i = 0; i < node->count; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+ spaces.data, i, n32->base.chunks[i]);
+ }
+ else
+ {
+ RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+ appendStringInfo(buf, "%schunk[%d] 0x%X",
+ spaces.data, i, n32->base.chunks[i]);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, n32->children[i], level + 1,
+ recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+ char *sep = "";
+
+ appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ appendStringInfo(buf, "%s[%d]=%d ",
+ sep, i, b125->slot_idxs[i]);
+ sep = ",";
+ }
+
+ appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+ for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+ appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+ appendStringInfo(buf, "\n");
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+ continue;
+
+ if (RT_NODE_IS_LEAF(node))
+ appendStringInfo(buf, "%schunk 0x%X\n",
+ spaces.data, i);
+ else
+ {
+ RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+ appendStringInfo(buf, "%schunk 0x%X",
+ spaces.data, i);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i),
+ level + 1, recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+ for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+ appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+ appendStringInfo(buf, "\n");
+ }
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ appendStringInfo(buf, "%schunk 0x%X\n",
+ spaces.data, i);
+ }
+ else
+ {
+ RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+ continue;
+
+ appendStringInfo(buf, "%schunk 0x%X",
+ spaces.data, i);
+
+ if (recurse)
+ {
+ appendStringInfo(buf, "\n");
+ RT_DUMP_NODE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i),
+ level + 1, recurse, buf);
+ }
+ else
+ appendStringInfo(buf, " (skipped)\n");
+ }
+ }
+ break;
+ }
+ }
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL node;
+ StringInfoData buf;
+ int shift;
+ int level = 0;
+
+ RT_STATS(tree);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ if (key > tree->ctl->max_val)
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+ key, key);
+ return;
+ }
+
+ initStringInfo(&buf);
+ allocnode = tree->ctl->root;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift = node->shift;
+ while (shift >= 0)
+ {
+ RT_PTR_ALLOC child;
+
+ RT_DUMP_NODE(tree, allocnode, level, false, &buf);
+
+ if (RT_NODE_IS_LEAF(node))
+ {
+ RT_VALUE_TYPE dummy;
+
+ /* We reached at a leaf node, find the corresponding slot */
+ RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+ break;
+ }
+
+ if (!RT_NODE_SEARCH_INNER(node, key, &child))
+ break;
+
+ allocnode = child;
+ node = RT_PTR_GET_LOCAL(tree, allocnode);
+ shift -= RT_NODE_SPAN;
+ level++;
+ }
+ RT_UNLOCK(tree);
+
+ fprintf(stderr, "%s", buf.data);
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+ StringInfoData buf;
+
+ RT_STATS(tree);
+
+ RT_LOCK_SHARED(tree);
+
+ if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+ {
+ RT_UNLOCK(tree);
+ fprintf(stderr, "empty tree\n");
+ return;
+ }
+
+ initStringInfo(&buf);
+
+ RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+ RT_UNLOCK(tree);
+
+ fprintf(stderr, "%s",buf.data);
+}
+#endif
+
+#endif /* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef RT_BM_IDX
+#undef RT_BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_3
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_3
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_3
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
+#undef RT_CLASS_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_SWITCH_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..5f6dda1f12
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,122 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_delete_impl.h
+ * Common implementation for deletion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ * TODO: Shrink nodes when deletion would allow them to fit in a smaller
+ * size class.
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_delete_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+ n3->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+ n3->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
+ n32->base.n.count, idx);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+ n32->base.n.count, idx);
+#endif
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+ int idx;
+ int bitnum;
+
+ if (slotpos == RT_INVALID_SLOT_IDX)
+ return false;
+
+ idx = RT_BM_IDX(slotpos);
+ bitnum = RT_BM_BIT(slotpos);
+ n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+ n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+ RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+ break;
+ }
+ }
+
+ /* update statistics */
+ node->count--;
+
+ return true;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..d56e58dcac
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,328 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_insert_impl.h
+ * Common implementation for insertion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_insert_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ const bool is_leaf = true;
+ bool chunk_exists = false;
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+ const bool is_leaf = false;
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ int idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n3->values[idx] = *value_p;
+ break;
+ }
+#endif
+ if (unlikely(RT_NODE_MUST_GROW(n3)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE32_TYPE *new32;
+ const uint8 new_kind = RT_NODE_KIND_32;
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
+
+ /* grow node from 3 to 32 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
+ new32->base.chunks, new32->values);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
+ new32->base.chunks, new32->children);
+#endif
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+ int count = n3->base.n.count;
+
+ /* shift chunks and children */
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
+ count, insertpos);
+#endif
+ }
+
+ n3->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n3->values[insertpos] = *value_p;
+#else
+ n3->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_32:
+ {
+ const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ int idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+
+ if (idx != -1)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n32->values[idx] = *value_p;
+ break;
+ }
+#endif
+ if (unlikely(RT_NODE_MUST_GROW(n32)) &&
+ n32->base.n.fanout < class32_max.fanout)
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+ const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+ Assert(n32->base.n.fanout == class32_min.fanout);
+
+ /* grow to the next size class of this kind */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+ n32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ memcpy(newnode, node, class32_min.leaf_size);
+#else
+ memcpy(newnode, node, class32_min.inner_size);
+#endif
+ newnode->fanout = class32_max.fanout;
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+
+ if (unlikely(RT_NODE_MUST_GROW(n32)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE125_TYPE *new125;
+ const uint8 new_kind = RT_NODE_KIND_125;
+ const RT_SIZE_CLASS new_class = RT_CLASS_125;
+
+ Assert(n32->base.n.fanout == class32_max.fanout);
+
+ /* grow node from 32 to 125 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new125 = (RT_NODE125_TYPE *) newnode;
+
+ for (int i = 0; i < class32_max.fanout; i++)
+ {
+ new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ new125->values[i] = n32->values[i];
+#else
+ new125->children[i] = n32->children[i];
+#endif
+ }
+
+ /*
+ * Since we just copied a dense array, we can set the bits
+ * using a single store, provided the length of that array
+ * is at most the number of bits in a bitmapword.
+ */
+ Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+ new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+ int count = n32->base.n.count;
+
+ if (insertpos < count)
+ {
+ Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+ count, insertpos);
+#else
+ RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+ count, insertpos);
+#endif
+ }
+
+ n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = *value_p;
+#else
+ n32->children[insertpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos;
+ int cnt = 0;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ slotpos = n125->base.slot_idxs[chunk];
+ if (slotpos != RT_INVALID_SLOT_IDX)
+ {
+ /* found the existing chunk */
+ chunk_exists = true;
+ n125->values[slotpos] = *value_p;
+ break;
+ }
+#endif
+ if (unlikely(RT_NODE_MUST_GROW(n125)))
+ {
+ RT_PTR_ALLOC allocnode;
+ RT_PTR_LOCAL newnode;
+ RT_NODE256_TYPE *new256;
+ const uint8 new_kind = RT_NODE_KIND_256;
+ const RT_SIZE_CLASS new_class = RT_CLASS_256;
+
+ /* grow node from 125 to 256 */
+ allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+ newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+ new256 = (RT_NODE256_TYPE *) newnode;
+
+ for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+ {
+ if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+ continue;
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+ RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ cnt++;
+ }
+
+ RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+ node = newnode;
+ }
+ else
+ {
+ int idx;
+ bitmapword inverse;
+
+ /* get the first word with at least one bit not set */
+ for (idx = 0; idx < RT_BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+ {
+ if (n125->base.isset[idx] < ~((bitmapword) 0))
+ break;
+ }
+
+ /* To get the first unset bit in X, get the first set bit in ~X */
+ inverse = ~(n125->base.isset[idx]);
+ slotpos = idx * BITS_PER_BITMAPWORD;
+ slotpos += bmw_rightmost_one_pos(inverse);
+ Assert(slotpos < node->fanout);
+
+ /* mark the slot used */
+ n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+ n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ n125->values[slotpos] = *value_p;
+#else
+ n125->children[slotpos] = child;
+#endif
+ break;
+ }
+ }
+ /* FALLTHROUGH */
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+ Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
+ RT_NODE_LEAF_256_SET(n256, chunk, *value_p);
+#else
+ Assert(node->count < RT_NODE_MAX_SLOTS);
+ RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+ break;
+ }
+ }
+
+ /* Update statistics */
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!chunk_exists)
+ node->count++;
+#else
+ node->count++;
+#endif
+
+ /*
+ * Done. Finally, verify the chunk and value is inserted or replaced
+ * properly in the node.
+ */
+ RT_VERIFY_NODE(node);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return chunk_exists;
+#else
+ return;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..98c78eb237
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,153 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_iter_impl.h
+ * Common implementation for iteration in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_iter_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ bool found = false;
+ uint8 key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ RT_VALUE_TYPE value;
+
+ Assert(RT_NODE_IS_LEAF(node_iter->node));
+#else
+ RT_PTR_LOCAL child = NULL;
+
+ Assert(!RT_NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+ Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+ switch (node_iter->node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n3->base.n.count)
+ break;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n3->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+#endif
+ key_chunk = n3->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+ node_iter->current_idx++;
+ if (node_iter->current_idx >= n32->base.n.count)
+ break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ value = n32->values[node_iter->current_idx];
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+ key_chunk = n32->base.chunks[node_iter->current_idx];
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+ if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+ int i;
+
+ for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+ {
+#ifdef RT_NODE_LEVEL_LEAF
+ if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+ if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+ break;
+ }
+
+ if (i >= RT_NODE_MAX_SLOTS)
+ break;
+
+ node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+ value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+ child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+ key_chunk = i;
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = value;
+#endif
+ }
+
+#ifdef RT_NODE_LEVEL_LEAF
+ return found;
+#else
+ return child;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..a8925c75d0
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,138 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_search_impl.h
+ * Common implementation for search in leaf and inner nodes, plus
+ * update for inner nodes only.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_search_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+ uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(value_p != NULL);
+ Assert(RT_NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+ Assert(child_p != NULL);
+#endif
+ Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+ switch (node->kind)
+ {
+ case RT_NODE_KIND_3:
+ {
+ RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+ int idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n3->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = n3->values[idx];
+#else
+ *child_p = n3->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_32:
+ {
+ RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+ int idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+ Assert(idx >= 0);
+ n32->children[idx] = new_child;
+#else
+ if (idx < 0)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = n32->values[idx];
+#else
+ *child_p = n32->children[idx];
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_125:
+ {
+ RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+ int slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+ Assert(slotpos != RT_INVALID_SLOT_IDX);
+ n125->children[slotpos] = new_child;
+#else
+ if (slotpos == RT_INVALID_SLOT_IDX)
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+ *child_p = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ case RT_NODE_KIND_256:
+ {
+ RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+ RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+ if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+ if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+ return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+ *value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+ *child_p = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif /* RT_ACTION_UPDATE */
+ break;
+ }
+ }
+
+#ifdef RT_ACTION_UPDATE
+ return;
+#else
+ return true;
+#endif /* RT_ACTION_UPDATE */
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..2af215484f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,6 +121,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
extern void dsa_free(dsa_area *area, dsa_pointer dp);
extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
extern void dsa_trim(dsa_area *area);
extern void dsa_dump(dsa_area *area);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 79e3033ec2..89f42bf9e3 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
test_pg_db_role_setting \
test_pg_dump \
test_predtest \
+ test_radixtree \
test_rbtree \
test_regex \
test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index dcb82ed68f..beaf4080fb 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -25,6 +25,7 @@ subdir('test_parser')
subdir('test_pg_db_role_setting')
subdir('test_pg_dump')
subdir('test_predtest')
+subdir('test_radixtree')
subdir('test_rbtree')
subdir('test_regex')
subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+ $(WIN32RES) \
+ test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark. If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE: testing basic operations with leaf node 4
+NOTICE: testing basic operations with inner node 4
+NOTICE: testing basic operations with leaf node 32
+NOTICE: testing basic operations with inner node 32
+NOTICE: testing basic operations with leaf node 125
+NOTICE: testing basic operations with inner node 125
+NOTICE: testing basic operations with leaf node 256
+NOTICE: testing basic operations with inner node 256
+NOTICE: testing radix tree node types with shift "0"
+NOTICE: testing radix tree node types with shift "8"
+NOTICE: testing radix tree node types with shift "16"
+NOTICE: testing radix tree node types with shift "24"
+NOTICE: testing radix tree node types with shift "32"
+NOTICE: testing radix tree node types with shift "40"
+NOTICE: testing radix tree node types with shift "48"
+NOTICE: testing radix tree node types with shift "56"
+NOTICE: testing radix tree with pattern "all ones"
+NOTICE: testing radix tree with pattern "alternating bits"
+NOTICE: testing radix tree with pattern "clusters of ten"
+NOTICE: testing radix tree with pattern "clusters of hundred"
+NOTICE: testing radix tree with pattern "one-every-64k"
+NOTICE: testing radix tree with pattern "sparse"
+NOTICE: testing radix tree with pattern "single values, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE: testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+ 'test_radixtree.c',
+)
+
+if host_system == 'windows'
+ test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_radixtree',
+ '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+ test_radixtree_sources,
+ link_with: pgport_srv,
+ kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+ 'test_radixtree.control',
+ 'test_radixtree--1.0.sql',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'test_radixtree',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_radixtree',
+ ],
+ },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..afe53382f3
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,681 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ * Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed. That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int rt_node_kind_fanouts[] = {
+ 0,
+ 4, /* RT_NODE_KIND_4 */
+ 32, /* RT_NODE_KIND_32 */
+ 125, /* RT_NODE_KIND_125 */
+ 256 /* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+ char *test_name; /* short name of the test, for humans */
+ char *pattern_str; /* a bit pattern */
+ uint64 spacing; /* pattern repeats at this interval */
+ uint64 num_values; /* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+ {
+ "all ones", "1111111111",
+ 10, 1000000
+ },
+ {
+ "alternating bits", "0101010101",
+ 10, 1000000
+ },
+ {
+ "clusters of ten", "1111111111",
+ 10000, 1000000
+ },
+ {
+ "clusters of hundred",
+ "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+ 10000, 1000000
+ },
+ {
+ "one-every-64k", "1",
+ 65536, 1000000
+ },
+ {
+ "sparse", "100000000000000000000000000000001",
+ 10000000, 1000000
+ },
+ {
+ "single values, distance > 2^32", "1",
+ UINT64CONST(10000000000), 100000
+ },
+ {
+ "clusters, distance > 2^32", "10101010",
+ UINT64CONST(10000000000), 1000000
+ },
+ {
+ "clusters, distance > 2^60", "10101010",
+ UINT64CONST(2000000000000000000),
+ 23 /* can't be much higher than this, or we
+ * overflow uint64 */
+ }
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+/* #define RT_SHMEM */
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+ return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ TestValueType dummy;
+ uint64 key;
+ TestValueType val;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_radix_tree");
+ dsa = dsa_create(tranche_id);
+
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ if (rt_search(radixtree, 0, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, 1, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+ elog(ERROR, "rt_search on empty tree returned true");
+
+ if (rt_delete(radixtree, 0))
+ elog(ERROR, "rt_delete on empty tree returned true");
+
+ if (rt_num_entries(radixtree) != 0)
+ elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+ iter = rt_begin_iterate(radixtree);
+
+ if (rt_iterate_next(iter, &key, &val))
+ elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+ rt_end_iterate(iter);
+
+ rt_free(radixtree);
+
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+ rt_radix_tree *radixtree;
+ uint64 *keys;
+ int shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_radix_tree");
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing basic operations with %s node %d",
+ test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /* prepare keys in order like 1, 32, 2, 31, 2, ... */
+ keys = palloc(sizeof(uint64) * children);
+ for (int i = 0; i < children; i++)
+ {
+ if (i % 2 == 0)
+ keys[i] = (uint64) ((i / 2) + 1) << shift;
+ else
+ keys[i] = (uint64) (children - (i / 2)) << shift;
+ }
+
+ /* insert keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ /* look up keys */
+ for (int i = 0; i < children; i++)
+ {
+ TestValueType value;
+
+ if (!rt_search(radixtree, keys[i], &value))
+ elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (value != (TestValueType) keys[i])
+ elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+ value, (TestValueType) keys[i]);
+ }
+
+ /* update keys */
+ for (int i = 0; i < children; i++)
+ {
+ TestValueType update = keys[i] + 1;
+ if (!rt_set(radixtree, keys[i], (TestValueType*) &update))
+ elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+ }
+
+ /* repeat deleting and inserting keys */
+ for (int i = 0; i < children; i++)
+ {
+ if (!rt_delete(radixtree, keys[i]))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+ if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+ elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+ }
+
+ pfree(keys);
+ rt_free(radixtree);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+ int incr)
+{
+ for (int i = start; i < end; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ TestValueType val;
+
+ if (!rt_search(radixtree, key, &val))
+ elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+ key, end);
+ if (val != (TestValueType) key)
+ elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ key, val, key);
+ }
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+ uint64 num_entries;
+ int ninserted = 0;
+ int start = insert_asc ? 0 : 256;
+ int incr = insert_asc ? 1 : -1;
+ int end = insert_asc ? 256 : 0;
+ int node_kind_idx = 1;
+
+ for (int i = start; i != end; i += incr)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_set(radixtree, key, (TestValueType*) &key);
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+ /*
+ * After filling all slots in each node type, check if the values
+ * are stored properly.
+ */
+ if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+ {
+ int check_start = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx - 1]
+ : rt_node_kind_fanouts[node_kind_idx];
+ int check_end = insert_asc
+ ? rt_node_kind_fanouts[node_kind_idx]
+ : rt_node_kind_fanouts[node_kind_idx - 1];
+
+ check_search_on_node(radixtree, shift, check_start, check_end, incr);
+ node_kind_idx++;
+ }
+
+ ninserted++;
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ if (num_entries != 256)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+ uint64 num_entries;
+
+ for (int i = 0; i < 256; i++)
+ {
+ uint64 key = ((uint64) i << shift);
+ bool found;
+
+ found = rt_delete(radixtree, key);
+
+ if (!found)
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+ }
+
+ num_entries = rt_num_entries(radixtree);
+
+ /* The tree must be empty */
+ if (num_entries != 0)
+ elog(ERROR,
+ "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+ num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+ rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_radix_tree");
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+ radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+ /*
+ * Insert and search entries for every node type at the 'shift' level,
+ * then delete all entries to make it empty, and insert and search entries
+ * again.
+ */
+ test_node_types_insert(radixtree, shift, true);
+ test_node_types_delete(radixtree, shift);
+ test_node_types_insert(radixtree, shift, false);
+
+ rt_free(radixtree);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+ rt_radix_tree *radixtree;
+ rt_iter *iter;
+ MemoryContext radixtree_ctx;
+ TimestampTz starttime;
+ TimestampTz endtime;
+ uint64 n;
+ uint64 last_int;
+ uint64 ndeleted;
+ uint64 nbefore;
+ uint64 nafter;
+ int patternlen;
+ uint64 *pattern_values;
+ uint64 pattern_num_values;
+#ifdef RT_SHMEM
+ int tranche_id = LWLockNewTrancheId();
+ dsa_area *dsa;
+
+ LWLockRegisterTranche(tranche_id, "test_radix_tree");
+ dsa = dsa_create(tranche_id);
+#endif
+
+ elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+ if (rt_test_stats)
+ fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+ /* Pre-process the pattern, creating an array of integers from it. */
+ patternlen = strlen(spec->pattern_str);
+ pattern_values = palloc(patternlen * sizeof(uint64));
+ pattern_num_values = 0;
+ for (int i = 0; i < patternlen; i++)
+ {
+ if (spec->pattern_str[i] == '1')
+ pattern_values[pattern_num_values++] = i;
+ }
+
+ /*
+ * Allocate the radix tree.
+ *
+ * Allocate it in a separate memory context, so that we can print its
+ * memory usage easily.
+ */
+ radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+ "radixtree test",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+ radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+#else
+ radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+ /*
+ * Add values to the set.
+ */
+ starttime = GetCurrentTimestamp();
+
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ uint64 x = 0;
+
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ bool found;
+
+ x = last_int + pattern_values[i];
+
+ found = rt_set(radixtree, x, (TestValueType*) &x);
+
+ if (found)
+ elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+
+ endtime = GetCurrentTimestamp();
+
+ if (rt_test_stats)
+ fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+ spec->num_values, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Print stats on the amount of memory used.
+ *
+ * We print the usage reported by rt_memory_usage(), as well as the stats
+ * from the memory context. They should be in the same ballpark, but it's
+ * hard to automate testing that, so if you're making changes to the
+ * implementation, just observe that manually.
+ */
+ if (rt_test_stats)
+ {
+ uint64 mem_usage;
+
+ /*
+ * Also print memory usage as reported by rt_memory_usage(). It
+ * should be in the same ballpark as the usage reported by
+ * MemoryContextStats().
+ */
+ mem_usage = rt_memory_usage(radixtree);
+ fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+ mem_usage, (double) mem_usage / spec->num_values);
+
+ MemoryContextStats(radixtree_ctx);
+ }
+
+ /* Check that rt_num_entries works */
+ n = rt_num_entries(radixtree);
+ if (n != spec->num_values)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_search()
+ */
+ starttime = GetCurrentTimestamp();
+
+ for (n = 0; n < 100000; n++)
+ {
+ bool found;
+ bool expected;
+ uint64 x;
+ TestValueType v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Do we expect this value to be present in the set? */
+ if (x >= last_int)
+ expected = false;
+ else
+ {
+ uint64 idx = x % spec->spacing;
+
+ if (idx >= patternlen)
+ expected = false;
+ else if (spec->pattern_str[idx] == '1')
+ expected = true;
+ else
+ expected = false;
+ }
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (found != expected)
+ elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+ if (found && (v != (TestValueType) x))
+ elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+ v, x);
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ /*
+ * Test iterator
+ */
+ starttime = GetCurrentTimestamp();
+
+ iter = rt_begin_iterate(radixtree);
+ n = 0;
+ last_int = 0;
+ while (n < spec->num_values)
+ {
+ for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+ {
+ uint64 expected = last_int + pattern_values[i];
+ uint64 x;
+ TestValueType val;
+
+ if (!rt_iterate_next(iter, &x, &val))
+ break;
+
+ if (x != expected)
+ elog(ERROR,
+ "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+ x, expected, i);
+ if (val != (TestValueType) expected)
+ elog(ERROR,
+ "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+ n++;
+ }
+ last_int += spec->spacing;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+ n, (int) (endtime - starttime) / 1000);
+
+ rt_end_iterate(iter);
+
+ if (n < spec->num_values)
+ elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+ if (n > spec->num_values)
+ elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+ /*
+ * Test random-access probes with rt_delete()
+ */
+ starttime = GetCurrentTimestamp();
+
+ nbefore = rt_num_entries(radixtree);
+ ndeleted = 0;
+ for (n = 0; n < 1; n++)
+ {
+ bool found;
+ uint64 x;
+ TestValueType v;
+
+ /*
+ * Pick next value to probe at random. We limit the probes to the
+ * last integer that we added to the set, plus an arbitrary constant
+ * (1000). There's no point in probing the whole 0 - 2^64 range, if
+ * only a small part of the integer space is used. We would very
+ * rarely hit values that are actually in the set.
+ */
+ x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+ /* Is it present according to rt_search() ? */
+ found = rt_search(radixtree, x, &v);
+
+ if (!found)
+ continue;
+
+ /* If the key is found, delete it and check again */
+ if (!rt_delete(radixtree, x))
+ elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_search(radixtree, x, &v))
+ elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+ if (rt_delete(radixtree, x))
+ elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+ ndeleted++;
+ }
+ endtime = GetCurrentTimestamp();
+ if (rt_test_stats)
+ fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+ ndeleted, (int) (endtime - starttime) / 1000);
+
+ nafter = rt_num_entries(radixtree);
+
+ /* Check that rt_num_entries works */
+ if ((nbefore - ndeleted) != nafter)
+ elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+ nafter, (nbefore - ndeleted), ndeleted);
+
+ rt_free(radixtree);
+ MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+ dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+ test_empty();
+
+ for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+ {
+ test_basic(rt_node_kind_fanouts[i], false);
+ test_basic(rt_node_kind_fanouts[i], true);
+ }
+
+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
+
+ /* Test different test patterns, with lots of entries */
+ for (int i = 0; i < lengthof(test_specs); i++)
+ test_pattern(&test_specs[i]);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index 4e09c4686b..202bf1c04e 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index 8dee1b5670..133313255c 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue
+ # radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
# These files are not meant to be included standalone, because
# they contain lists that might have multiple use-cases.
test "$f" = src/include/access/rmgrlist.h && continue
--
2.31.1
On Mon, Apr 17, 2023 at 8:49 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
- With lazy expansion and single-value leaves, the root of a radix tree
can point to a single leaf. That might get rid of the need to track
TBMStatus, since setting a single-leaf tree should be cheap.
Instead of introducing single-value leaves to the radix tree as
another structure, can we store pointers to PagetableEntry as values?
Well, that's pretty much what a single-value leaf is. Now that I've had
time to pause and regroup, I've looked into some aspects we previously put
off for future work, and this is one of them.
The concept is really quite trivial, and it's the simplest and most
flexible way to implement ART. Our, or at least my, documented reason not
to go that route was due to "an extra pointer traversal", but that's
partially mitigated by "lazy expansion", which is actually fairly easy to
do with single-value leaves. The two techniques complement each other in a
natural way. (Path compression, on the other hand, is much more complex.)
Note: I've moved the CF entry to the next CF, and set to waiting on
author for now. Since no action is currently required from Masahiko, I've
added myself as author as well. If tackling bitmap heap scan shows promise,
we could RWF and resurrect at a later time.
Thanks. I'm going to continue researching the memory limitation and
Sounds like the best thing to nail down at this point.
try lazy path expansion until PG17 development begins.
This doesn't seem like a useful thing to try and attach into the current
patch (if that's what you mean), as the current insert/delete paths are
quite complex. Using bitmap heap scan as a motivating use case, I hope to
refocus complexity to where it's most needed, and aggressively simplify
where possible.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Apr 19, 2023 at 4:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Mon, Apr 17, 2023 at 8:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
- With lazy expansion and single-value leaves, the root of a radix tree can point to a single leaf. That might get rid of the need to track TBMStatus, since setting a single-leaf tree should be cheap.
Instead of introducing single-value leaves to the radix tree as
another structure, can we store pointers to PagetableEntry as values?Well, that's pretty much what a single-value leaf is. Now that I've had time to pause and regroup, I've looked into some aspects we previously put off for future work, and this is one of them.
The concept is really quite trivial, and it's the simplest and most flexible way to implement ART. Our, or at least my, documented reason not to go that route was due to "an extra pointer traversal", but that's partially mitigated by "lazy expansion", which is actually fairly easy to do with single-value leaves. The two techniques complement each other in a natural way. (Path compression, on the other hand, is much more complex.)
Note: I've moved the CF entry to the next CF, and set to waiting on author for now. Since no action is currently required from Masahiko, I've added myself as author as well. If tackling bitmap heap scan shows promise, we could RWF and resurrect at a later time.
Thanks. I'm going to continue researching the memory limitation and
Sounds like the best thing to nail down at this point.
try lazy path expansion until PG17 development begins.
This doesn't seem like a useful thing to try and attach into the current patch (if that's what you mean), as the current insert/delete paths are quite complex. Using bitmap heap scan as a motivating use case, I hope to refocus complexity to where it's most needed, and aggressively simplify where possible.
I agree that we don't want to make the current patch complex further.
Thinking about the memory limitation more, I think that combination of
the idea of specifying the initial and max DSA segment size and
dsa_set_size_limit() works well. There are two points in terms of
memory limitation; when the memory usage reaches the limit we want (1)
to minimize the last allocated memory block that is allocated but not
used yet and (2) to minimize the amount of memory that exceeds the
memory limit. Since we can specify the maximum DSA segment size, the
last allocated block before reaching the memory limit is small. Also,
thanks to dsa_set_size_limit(), the total DSA size will stop at the
limit, so (memory_usage >= memory_limit) returns true without any
exceeding memory.
Given that we need to configure the initial and maximum DSA segment
size and set the DSA limit for TidStore memory accounting and
limiting, it would be better to create the DSA for TidStore by
TidStoreCreate() API, rather than creating DSA in the caller and pass
it to TidStoreCreate().
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Fri, Apr 7, 2023 at 4:55 PM John Naylor <john.naylor@enterprisedb.com>
wrote:
- Fixed-size PagetableEntry's are pretty large, but the tid compression
scheme used in this thread (in addition to being complex) is not a great
fit for tidbitmap because it makes it more difficult to track per-block
metadata (see also next point). With the "combined pointer-value slots"
technique, if a page's max tid offset is 63 or less, the offsets can be
stored directly in the pointer for the exact case. The lowest bit can tag
to indicate a pointer to a single-value leaf. That would complicate
operations like union/intersection and tracking "needs recheck", but it
would reduce memory use and node-traversal in common cases.
[just getting some thoughts out there before I have something concrete]
Thinking some more, this needn't be complicated at all. We'd just need to
reserve some bits of a bitmapword for the tag, as well as flags for
"ischunk" and "recheck". The other bits can be used for offsets.
Getting/storing the offsets basically amounts to adjusting the shift by a
constant. That way, this "embeddable PTE" could serve as both "PTE embedded
in a node pointer" and also the first member of a full PTE. A full PTE is
now just an array of embedded PTEs, except only the first one has the flags
we need. That reduces the number of places that have to be different.
Storing any set of offsets all less than ~60 would save
allocation/traversal in a large number of real cases. Furthermore, that
would reduce a full PTE to 40 bytes because there would be no padding.
This all assumes the key (block number) is no longer stored in the PTE,
whether embedded or not. That would mean this technique:
- With lazy expansion and single-value leaves, the root of a radix tree
can point to a single leaf. That might get rid of the need to track
TBMStatus, since setting a single-leaf tree should be cheap.
...is not a good trade off because it requires each leaf to have the key,
and would thus reduce the utility of embedded leaves. We just need to make
sure storing a single value is not costly, and I suspect it's not.
(Currently the overhead avoided is allocating and zeroing a few kilobytes
for a hash table). If it is not, then we don't need a special case in
tidbitmap, which would be a great simplification. If it is, there are other
ways to mitigate.
--
John Naylor
EDB: http://www.enterprisedb.com
I wrote:
the current insert/delete paths are quite complex. Using bitmap heap scan
as a motivating use case, I hope to refocus complexity to where it's most
needed, and aggressively simplify where possible.
Sometime in the not-too-distant future, I will start a new thread focusing
on bitmap heap scan, but for now, I just want to share some progress on
making the radix tree usable not only for that, but hopefully a wider range
of applications, while making the code simpler and the binary smaller. The
attached patches are incomplete (e.g. no iteration) and quite a bit messy,
so tar'd and gzip'd for the curious (should apply on top of v32 0001-03 +
0007-09 ).
0001
This combines a few concepts that I didn't bother separating out after the
fact:
- Split insert_impl.h into multiple functions for improved readability and
maintainability.
- Use single-value leaves as the basis for storing values, with the goal to
get to "combined pointer-value slots" for efficiency and flexibility.
- With the latter in mind, searching the child within a node now returns
the address of the slot. This allows the same interface whether the slot
contains a child pointer or a value.
- Starting with RT_SET, start turning some iterative algorithms into
recursive ones. This is a more natural way to traverse a tree structure,
and we already see an advantage: Previously when growing a node, we
searched within the parent to update its reference to the new node, because
we didn't know the slot we descended from. Now we can simply update a
single variable.
- Since we recursively pass the "shift" down the stack, it doesn't have to
be stored in any node -- only the "top-level" start shift is stored in the
tree control struct. This was easy to code since the node's shift value was
hardly ever accessed anyway! The node header shrinks from 5 bytes to 4.
0002
Back in v15, we tried keeping DSA/local pointers as members of a struct. I
did not like the result, but still thought it was a good idea. RT_DELETE is
a complex function and I didn't want to try rewriting it without a pointer
abstraction, so I've resurrected this idea, but in a simpler, less
intrusive way. A key difference from v15 is using a union type for the
non-shmem case.
0004
Rewrite RT_DELETE using recursion. I find this simpler than the previous
open-coded stack.
0005-06
Deletion has an inefficiency: One function searches for the child to see if
it's there, then another function searches for it again to delete it. Since
0001, a successful child search returns the address of the slot, so we can
save it. For the two smaller "linear search" node kinds we can then use a
single subtraction to compute the chunk/slot index for deletion. Also,
split RT_NODE_DELETE_INNER into separate functions, for a similar reason as
the insert case in 0001.
0007
Anticipate node shrinking: If only one node-kind needs to be freed, we can
move a branch to that one code path, rather than every place where RT_FREE
is inlined.
0009
Teach node256 how to shrink *. Since we know the number of children in a
node256 can't possibly be zero, we can use uint8 to store the count and
interpret an overflow to zero as 256 for this node. The node header shrinks
from 4 bytes to 3.
* Other nodes will follow in due time, but only after I figure out how to
do it nicely (ideas welcome!) -- currently node32's two size classes work
fine for growing, but the code should be simplified before extending to
other cases.)
0010
Limited support for "combined pointer-value slots". At compile-time, choose
either that or "single-value leaves" based on the size of the value type
template parameter. Values that are pointer-sized or less can fit in the
last-level child slots of nominal "inner nodes" without duplicated
leaf-node code. Node256 now must act like the previous 'node256 leaf',
since zero is a valid value. Aside from that, this was a small change.
What I've shared here could work (in principal, since it uses uint64
values) for tidstore, possibly faster (untested) because of better code
density, but as mentioned I want to shoot for higher. For tidbitmap.c, I
want to extend this idea and branch at run-time on a per-value basis, so
that a page-table entry that fits in a pointer can go there, and if not,
it'll be a full leaf. (This technique enables more flexibility in
lossifying pages as well.) Run-time info will require e.g. an additional
bit per slot. Since the node header is now 3 bytes, we can spare one more
byte in the node3 case. In addition, we can and should also bump it back up
to node4, still keeping the metadata within 8 bytes (no struct padding).
I've started in this patchset to refer to the node kinds as "4/16/48/256",
regardless of their actual fanout. This is for readability (by matching the
language in the paper) and maintainability (should *not* ever change
again). The size classes (including multiple classes per kind) could be
determined by macros and #ifdef's. For example, in non-SIMD architectures,
it's likely slow to search an array of 32 key chunks, so in that case the
compiler should choose size classes similar to these four nominal kinds.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v33-ART.tar.gzapplication/gzip; name=v33-ART.tar.gzDownload
� �\{w��������u��%�I=,9Nr��J��_���d�9:EYl$RKR~l���wf ��ST�v���:2 f������f����m�����Z5m��w:~r�n�����V��t�^w���������������M�_�t�n6[�)�N���M��fK��:���>�|���qU������fV�]�������5k/��6��5M�w�VO�w�=Ukw��e6o*��i*-�]X&���X����c-��"�y��t�.���e�W���I������Wu��{S<S]�%R�2��.�'��j3��e����f�f��,���?t�}�~�L�����m�TJ�#��������dke�Z���p�Ts����
Qm�&����1g�ntVr���e�Y�\/V�����4��6�9��������0��i�a!�g�J��4���&4�J����V������o��o�{�6�F���O��lc��Q��!���Z�l��m�k����������?��-V�~��5S[m�zc�mun<�������z�.�?5��Ag:G������|�8�1��f��n��n�V��r����-c���$t�^o��0}�pu��b�\$(��D��5��t'�+JD��:]���8��N��q��RJ�r���&�3qJ�r���3Pg
y���*��bz�3\�6�Lc���h�s��uZ��s|��k�z�����z��-x@kK�YsH�������=V�������
�t�����}:y�G[w�F�>E�ZS���j��sZ�S�l�^�5��n�����"xW�>�e@qeY���L�>m7S��Y��U��e���,||�|���e�3����q�X;n��JK�z�Y��
�D>#Y-hf�;�x�W�����s}a�:��L/��������d:�o&�Wc|~1�e8�\K������
37^N���M��E�������������r:��|�����Ii���i
���� �a�OSr�`1z��o��]�(V8aZ;y;�'�������tz!�<j;�{{u�!w��y��[�g�R�2�Vw:����Y�i`��r�[G�'�]^o�% �@�0M��Vhk$kX:�u>��I�� � ;,|J$i*W���N9g�2"i2wyH� 3���;l�Sw�L�y�v��Y?��fq�����zg����"$��0�D|6�V�/�����I"�%�#jM�.�q�=��]�����9�JH`�CX��2��v��� (ow��c���F��q���z������{k�n�PX�&`��"+�
�#������6A������d�v�"����Q5@���
`p�8�fO��p8���&K�9U�#t!
�������9�XY��8_��K63\�@;�2C��N�VW�z5���H��l���'�
��d`\�
0@���,�p���z�9(��c�����
:��{�* �]K�z{y~���F�r�P(��-c^����A���J�`�q�X���1�r�������?3�@O#4��!=�I�?����g������`pk�����p\�7m
gIt@�k�eoM�X�U� �V?���j�*l��,T����f�f���i6l��'��n,P3��B�]ri�8�)RvV�[�d��� n�����a Qz�w���ky��g#�0<E�|�J��zr3����<.��g f-�v/n���o�~��Zf��T��ko����f���D�9_�|Iw����.��E�G�9��=��Spo�Z��Rg|B`��?:��1|*�����.�3�����Y�0(l�5���4t�������a���K�;����]m�,_L�
uS]���ai�K�M��J���E��
��kQ
��v�UXL)x��>8�y#�_KqdHP��N>\�P���wDp$�����n��nT,��[�r�t�N�$"�e��1c����������E�d��3�������=m��T����K!�-@���A��x�(�x�4��%���z�����f��3�~�fp6z?����\]Nn�����@f[�����BA 1���]O!P8��
�M�f��;e�}����������X G0������8�J��r�
D��
7]e���)���p��-��-�:��=S]� ���w~���W�������q�g��d�:
��s����*�=��Oz4������M�&��$iG��nC��p�M���`���������Z�>��G��#4J��� T����S�GLAEy$�W��0X�����t\V�J���#[_D-�U*������h�!�
��YI�W
0��Ms�� �3��2-��T�oF�>�p�Ce���']�!�{Oef�J��#���*�K�T�"�0�nj�����^zQ��n��,WY�)>Ir���K�4+����%a�+�(y���HK f!��?}����`d�,H!*��L��FL@��S���"E�$�����9��zD��i
K
w1���P6Ys;����~�=����m�����/.������9���o�K.U������E�I�8O�d��Z\\=�V�S�,���y� ���H�oO�
>�����%bR�S�ck�d��o��p��*U��UN/=��������1�fG��K~��[w����{�N{
�p~���?^S��������+���}�����)��Q1+������U
(p����cd + I���'m���-0Z*)��JO��&q���hPJ��2���3W�X%���x}^��A�\C�����]��U��lW��) �j��Y5��:"3��������v���T�����.j���|g����R,1q��������� ����N7C��AHL�x����2Y
?3\�4�M~M�&ylY�_����%��^�R �����e��La�^1>Z���i���/�6���o;�Q-������R�6��G��=!�?�ay�� �����`��+�<!��"�,;�� ?��5��{������_s*����������Z�.�<Ga�8��@l���:c�=BV���V��H'���������X�A���v"p�7�if����@���w����2�[L(��p�@'��}�;.?\��S�j����z��L��!�?��9��4{���H��UE@
R��YH���3�EL��� P����� s9��D�4���z���8�A�.X��^+��\����`��u�H�Vi�}���DLNg���s���>��L%�
��8ko�Q5��'/S��"{�P�q��u)m)1-e�x�)�@D
{�(�WCC��!��(���e�f�Sm���q�)��3)+�Nb~��L��'�)�,�;�c����� �g:-�Y.V>C��;r��<j������>�3���U
>���W�������h������
�L���^��f���Tp<P�-��������L�����B������2]��*�����}�EldB$SW�\,���n��y���h��D�/��m���<hb����jn_Z�{ �9�����9�2Z�J[����n?g����i� �������$y�zR���D)�M��k�HHT"D^�3\~�.��[*9�0������#;$��#��?�S8�i�D_t� ��q��UQ�2�<���y���LT�T
�r�N��Ln!�D��i���r�B��_m0��[/�4u�������%$O���4��MLj�Z<�fx��$Az ���w:h�d��y�g���'����%o�r�:~�u���=X�ugs�ro�k�r{}6���$g�x:�K#?��{o'��KA9�R���9^��m���7����?���.22m������f�ut0o�m�})�� ��6�-4���:�M�����|�6���N!9|
)SE����U<O�+� �A��:�$PA����4n�8�<���S�C����O� ����t�XNh��)�}AS�����Z��]���"��������Z�
��O����`�~�gGh$������,�B���u�b={��FS_���5����B��k8jNa�x A|��`��
���������*NP5A��������v�045�����[g�G�K�&U��X��vN�}t��./l� ���!����P������D-Y�W�)G&V��{�{] ���V��gxY��ce�� Ja���%�HC�'�x��X��E9�UOD��G����
�
Y���#��w�{!!���������B~����v"'�\�PHM����
��l�l�=UX�� J �V�_��%4�����_2����r:����h1��<TC:�m$Q9la�AzI�s���MA�f+Rz�=�C�|�ga3�u�u$��L-����o����V*�� �>�I�����j
g��_�LAT��� q����u����0�����2<
�V��2h��}�I��M8��gcU#�~�������,���]�#����T����1��� B6����9Eb�������\H�����������<{�#V�����j�&�
�0� �*>�����}p��m~b����w�d�(-�<�m�i�B�6� ���W���`8��xS��uve��4��m�"�1�Wo�����J�y#�S�\C^���{W(�N��{L{������s[�Xe�������
y��rn Y�0B��'�dR��d�
�v����JKE�mx�jO*�J��@}����(��������(�a*����nz"k��W�P-����(�������}>%���x���}�:��!&��W�Z��CoJ�7)T���^y ��`��-�W>o��ZiLh.F����i�S� ���)��d��a�O������I�����:%�6��`�`<~�WO���+ix)�m��N 1e\�#%�El������+]�W� �{�����(��I{����B8F�@4��
"�����.iP��%!y��6*��(�����y�T�N��M����F�tU2�`~������2��?U�][��lIe�O�rb/^0�V�^$��RA{�C���/�+J_�m2���k�[����K�A7Y!G�E!Uf��pR_��g�m���^��`8���:����p �����hEK����A�TA��>���(�����?�_A
���m�����fY6X?��$\�Q�#��@n�@���t�1��WRmZ�(���.F�2����d%R���h�F�������"�2�"2lMdY�H��y+���{!�P
���D��YC3����8����`K��bp����Y�]�5���6����NA�SxQ�q�"^y�"KG��
�k�����#�!D�.�L�!<��/V���F��k��K��,L/3�#a��b���q
��4���2O�H
M�f2y�v��!���&r ������n����i&����uI�����c���n%�j����-#���2�.?���b�mz�/.���2#�%E�\��7�G�_v����������Q���;���?�%������k}�m�J���_�t�o��F��4w���bW�rT<�b�����2P�FO-�����Hc������7D�?��ondn���y���,��4X
����g��w�{�R�8�J��(;Ai�eG�'"��{��>bd3NC��q���H���cr�0a5=�����W�6m��u *�R}C���<��$����[��;����9�*����1��w���@6�U�����
��`���n���6������� i�y��d�qm�W~����d8�a2�'3�J���P�R��s�@7����\~��P6��H�6�����w�O�~mlo<k��/x��y�b�U8�2���c���%��f����|�m�2n)G��6��"f��K�+�d���#�����o��p#MF��L�X�Kv�1�D$���d���M�!����y�h0��.��F/�����_� �'&������2�+?X��*�����9Kk��{G�r�2�
�!c8��o��k�c�;qq3&�E�1���<���!��/%�����8�S��s�������.m{����Y����:����������������������Ap�y�`����jx�*_�}g��o��L~\y ��q�D�]����"(�������D ��X�O������+Z�Tx��N8��N��_[���}�k��;m����/�������w�m$������h�cQ$(�����ul%������L�����$hqM�Z>,k=���vUu7������1�cI �]�]�_m��k�r����W�a������D�v��N�d��@��,G��&�4�[�4���F�M[���m��R���������vS*����{���&Kj����j��\���8�RVc�g��m�4X�Y �j]��O� <8N�{mD~o�)�]u �M �����Q@\(��R��Ws�3����E�M��x��o���:� hx8���������C#��
������pM+F��"�^4 ��q#�^�3A�����}�=��b���JS��j0 �|-z�zgfCQr����>��OE4jF,�����N�-�Y�'#�3�%�[s�����lh�K����u������ �F�rek�r��4�=k���5N��A��.6m��d��N��=;/t����� ��^���.N�����8?�vL>��4]f
�i!�e��r
>�a|��*�6\�XF�����h��r�o�fW��2|����nH��L�c�������*�Z��g�O���4ca�f��-�df,�����N� ��*�4a�uLX7���6����)��G�<��4���B7�8���>~G�k�����~%
�����j*(���m6�p5�<�Fuz���������(*�v�#����$<� q~�Z�>��c>@g��PDIg��> ���:��CP�C����.�kh�_�r*�P��4�>��C=������'<Z� �T,����u|U� ��5�����Y��e>.t�R������@ ����kew!3���������a���B��AN�Q-8?)�oc=���#Q��q��RK����7�o3�A�Z<?u��56����d�����.�z3[,�����C���&""��"8Y$!� ���g)2�����D�J�D�S|.���$lk�H8�M4cF�B��#Q2/� �e4�E����6Z����$D����@�m�B�$C:1/��[������{f`k���^y_�*��ugl<�%�4iZ�O��� ��3�����qY��u�����;/�])6P�D����c��{�)B��%P��%��=����.��~���kp?9W9�o�_�SMI���K)��T���lDf�@I0��3c���_��#�F��{~�>E����~����i����p>����T�3:%3� I&a�T��%u���0�U �I"�r��^2W2H�
qmF���a
!���uK�{I$#�N����86_�+�a|�����`0��"d��"�]����
�C�&�M���
��N��VJ�VSZ�N-�
*�S��C��q�D$X����29MF����2� 5�km��>�L�Y�����|�"�Ll�v��Si�uMd��6t�D��Ft�&6v;�x�N����3'P�V�_�F�!-�o.���Sb�`[�Y�#�f����6^�����m'!M���� ��[�o���HS|W���o��+�9SG ���r�8�
F��2-$�[y����uc��Z���=����jlO��E�d����r�=I{�y�W�k9��,���'IS�si��Ow9
����HG����B�����n�Q�H��Y��r��l�������z&O�+
>��5D���������N�9�?�qJu���S��BaZ������ �Cj�n�M�GoV��Vs���\��Mg�M@��]y�������6���
n�j{�V���A�=w�[���|�X����P����R��U�J(��~\��j����+��xF���.�~)���
���{� ��Ur���H��,b�,0 �c����"n>�!�e2aW��#'�[0����*!�(������i,1�J\���
�FNG�>�����o�w�I<��xe�����bz�k�)����w�b�7��q���B�=R~�%�$��T�{3r��%1t.���(PQ
Z�IA��Ve�`�{�t�w{��D�e��z����2���+�8��������;��v���p����F�_�,�1f��U��S�}�E~���DN�������w��wk�X�����#�C������t�yC� -��kIj�,��GS����{~;A�-1�bu�X������J�R��/���F����`�p� TV�����%�`���f����YY��Uv���:�E�+~�B�����j����Q�)��n���4������W���=����#����������t�L�;����w�A�=6�d��/�]�+�+ r[��1[���.�k^iV1���~���d���p��������,�%�Uu�v�s)4Bzst�G�.��x>o��G������H���W|[��mc�5c���������M!���(��nx��t&zm����-{�R�gv���$�kf��M�o��P�c�[i�T0�9�� w
_��0�y���
�Qy�'f��I�8g���5��G����L�2�����d���4|��r�a�8�#?,�#�m*�C�Q���W�?{�2oX2�
��<�����=��G�dk�p�����������������!Ya2���1����&��J�EK���^Tk�nn�R'_��W������X��~�����n��n?��Im���=knQ����#%GOG/�-��2J���O�Sj��Z^�wa��V�l7#��x~` ��d��NC���������Qj�h���LL�!'a<_�4|��O�~=�����d�f�����@^G�sz �4�C��F�
g���9!Fg��)(��Q�o��� u�.>��W�
��� 0�T�����>0����O7� �KDHt\�Mc�������s��5A���!���&+������*M��7�� ���7��>���:v������2���y�t������B'O&�}��';T#v�aN�����W>
JT=�����@���?������`�������]w�������!���9Ds����9f���U)Z�i
Q������i5�z�k�-�#�h�v��K]���"�]�_�"���9Q�����=���n�{Hz�wL*t���T/��K�L�� B��V��KWd���73
�N��R� �m�6s���u��Z5/LS�$�� /� ����/!�9Z�0I�}l�Q�tx
7�8�T�^B�R���*��_�1�w�8<���A�qz�x2�(jM�������$�q�����+(g����"
@z�g��\�h��L���mZ��EK�Y'�S]�(@�@����a@����W�h����oG�q���,:^��"����$��d��o�����k���a��<������wZ��\������o{�^�Wh�!@�s= e���N�PR���9�Ts�"RHtd�����Me���i��z:XJ���[8������?��5�Rbi���v�����t�g���m�������dB��F�wY�m���q8�� �Q�u���������Q�Y|OA�9�J��#��i�������W�e|�� 9� {����)?��$v��*��&�b����;� e���]?h�7����<>��O����M���g7�I��|��Q��N�x�S|p�z����?x]�2�� )�3��o}������M�8�*�`�W�c���#�}�v �/X;�VX?|��(d�gu���)3�A<�5 ��K?����d�d��L�&��S|~Z1����bk���i��S��-���t0;��Y��XfgO����Ta�w��w��)�#�9���$z�W4�#������#�X�A�F�C
g�����m��Y�P��FEv�0�#��
�}K���X���iFw�R���X��N�z-M�$`���Yy�4'��|�4ze#�4��������f�fC9�&Q�5[O� Nj���B�]�h��gT@?�f7�Z+���f�8�nd����+�-���k��5����`�>�
QM�"TM#�B\5Q�Z����������* _��U��AH���kM��{5�e�����~R����Z�Z9���S^����������?�E���J�7w��n\����=�6��}�2� �~�����X��km�����K��
�q�����:z}�( `�[��D��= ���U����W�>5:�q��/j�J��J�^���$���ws�m�k6WI�
�<��y���2�����_� ����o�t�LV��9�v%(��T=%��7��W��PY��<��h�����`S�}UAE�}Y����-�����n���N9�a�@6�WU�����f�������5��[�%r ��/�����;�0�6���.���;��T�6����_�%��
�y���K���^+���Z���-i� ��r�,�R�����Y���%X*����i&����X2�����Z�
�%S�sU�My�%���s�4��>�1��=�����9�S�<9��A�4�:��2}P���DHO)v�%C��6HR��y*-�T![�$&�D3x���V��\����,���|�D�J�[�RXj�)���W��D����EI,K�J�U��.#�N9���r�=Up��T�����g��5�l������c}N&n}H�#U��{����Ka �-�L����
j+w/���� �P8��'�$<5���q���BY�%���3�������cN�Z$1VA:%��o?~@�,>C��K�7�|r'�2L,�33�P����*���jc�o�1��d��<���%��~.�BY/�E�We���7�3S����A�5t��V�n+��e=���y��o�c���c���F���b�J�&f�d��'�]��3I�|���������f�j?�L~�i�\VZ�0�����RnZu�o�
HNb��0=I�i �|l�����C�1�v�F�?�|�0��3X��>����y��@���H��S������Uc2���rQ]��RqA6���e���k��/�/S�n����i�������V>[s���"(��A������\�5z���\l����vLkj���If���U�=!�on�C��*0��h�3� ��p�?���)!>�2V2��V\���d�� <S*Cx�(�$���D���f�/��4.^g��"��hX:��`��O�F~/�����j/��K�$�� �>��7������`�;���A�m5�o8�u����2��������)`����fA���i���`f)����)#���H �Z^���'�&����O:}~�*4��z}���� �������%FA,H�� ��d
�/��!��IK��c�7��������o2.��1{�R�@��x>t=p6*�=�vN�6����K�Ak���u[���l����GaP�D�T]���� U�P@U�L��*�</x�_������ ��"]�f�k�X/��(�p���x�x�d|!�g'w���@����-F.p���h�8�1~6��C�����Pb=i��%
���7�����M�/���MzM������/�PO�2��m!>I+�Uy��{�@�d�$=������/R`���\�����=�a�c�O�t'�xS��`I������l��bk�>�/dCR��q^���5�m��`} ���Hq��I$���{l�\`X]_����`v��l:FP""Z&����4����9� ���Q,�B�;��T����%��ZBTZyUm�[2�������!�����-�]l��7�l��@:��\�t�Q3�����7g�?����1u[`�{U��������5�� �h�a�� �
�������U �yA8�����0�<��)|b�\�`��p�q%|�����xd��=cH��]]_�=*M �1��z��5Uh�l/ZA� ��G��U��v�?�������������dO�I{.3��I�"���P�E�������W�L
�-������������;P��l*��q^�I�#���&X�h��v�F]{�zML���;��Qg��u��������$!�%/���D�R����W�.)�JS>�����t^PnSt�k�X�r���Y������"�~TfU�3��p��u���0{�vv�����!�����&m�-2�u|����G��
�;���r�� L�A�iT�J
i�*��Wd�W��|�|����SB��i0�,kL!~!H���Ze�I�s�L
,}:��������#4�������9����p8e��[���!����0 ��*�D��g+���&j�K�%�~t,�U��y81xG�� �A\��[��s�\u8���8��r�+�=����&�vf�^�������5Tcb���F;��,�4w�fArK~�#���3����~o����pTc��QX��C*k5����$��A�2+��'x|���� ��?�"e]�2��G�S]�9lKC��D�(`���@��9�=V�>:d�����Qc�7%���S��k��������V��������|�_�+u5_ R�:4�]�����xl�X�)U���K>�C��~9��,$�1���~��< ��!ad����
?���M���C.�����tN�Ci�����u�����ru��V�qMB'_>;��\����(y|
����m�R�?���f-�6!��v�������|_�3���U
0qF�QI&���~�?�97��nM>�?���i�$]x7��~X�����z@c���.p6�vk���9��9�E���H��Z��j����d�'�t�:���Za����N/������������8l�o����������� p\������!��� ���o �� N��yq���W���T7�*�`6��~z������V�m��0<����������sx��YArA���_��i�{���'�� ��n������w{��
������t��h��� lE]�o����}�0��\��g>_�}����>G�Ut7�'��`��S���nWj����l���}~+"t����.�9�C�)��'ax��Y��������_�>a����_�U������qj�O86#b������qrj�A��|c�'�3Ig����P�Q����cxm �C�`3������*� X�^��x�+���������G�[ k����x=���l��``1����D_[p�=����|I:^�C�8�����%�Px���K�L�I*�H���a��Ki�
����VI����������@�/�*[��Q���* H|����H�R"���(��f�HULqH�JJ�dT �Am A/��y�M�t�1P���������%by���S =�����}p���LH��Ro���T.K�W�2\�l��~�O��Zt�����"0gJ�L�eG��f��,�P�����`��h0 �>��o��w��:(��~5Ar6�m�_�3�/�'3Q���Z���y<J=�U���r��2��m��u'����s�QV��E#:���.9�>F� N:���$��O-�����%cL���#@BO��Q� l ��
���Z��U]�l�4�K�����Q����B����^��@p�:�~3��U����
}����,�
����-0��k��R�w �g�"x��d�������E�2#��Am���A���u�xHE*�M�f���(�
u >3��=��A� �beKz~���9�
��t��bU;FX&P��*b�T{5�2���Ja��Lo����ip�%�dH�F�H�d��4�q87�C���{��Hd��0�[2Li�5)W���+��m�S� �L����p*��z�.} �3�A��S��
x=��&B�l4���o7�s�l��mD���m��'�K��S��i�n4{D�h��#8q����$���,������"�:o\�~�����:���DVh��[����A��7�"�(�3Ab���S.1E�=�
K._-�
��-�����
j�.�W��<{���dNu-i9����[Px2�����[#��M�4[o��af��W�<�*_I���"�NbY����A^c�������[��m'0(^��r���O{�:�.E����(*�L/`���ypg�N.4E�����N��;��cA��s R�{n���&�S�)�Hq���P0��p��:��A5L�FGSx]%/�=�b
��2�����w\��'���{@��G�%L��YL���}X�Yk �G����P�)�F�
GC�{-�h4Sn��]���$�N��K~�8l,E$�i'�J��1�����$_���5\��E�sQ�p�8w�x'VM����I���>P���H=*i}f/?5:��N��X�T8Kb�����^M�$W��xn�Ax��[%��pf-�9v�Bh�����d
�s����Je_x�1��������,�E)����/�Jt/}����kBR�-�@Yo�*����f/�Pb}C�>�%��4=K��h,��g��=m�oxr������5#M:w��"����:}���KPv���1����8T�X�����)�C�'�3S�Pq�gF�f,%��_�����TY���I5#�& m7��bg�����X�<�����5o4��U��Vc��E#����Rq��<u�&����0�#����ps��9�[��!��y��<
ha�����A������c�}!.d�T�)5�9y��aK@|��1a0�������x��^�
>���m���K�Gc�Zj,4f=yX�1���\�m��t���3��g�qx
?������o�`3Y��+U���j���-l%�4�^���F����2�����r@�dS�����z|����v���-�I9y���X:����`��aU�t�@�>�X��N����@v���?^K8aw�t��<����j
mJ6��W��.>~���� �E��� ]�1�(�y� �����"�
�H�������N�����4B�����]�% 7E��(� �);��-����L>e����*�:�77z]n�Dl9#�I�K�]eZ]7�z��%�+�r0���D�������b2 Q
��l�;��H�;�{
�2I��gG ����O�&�9�&���yY}M�<P�6
�$F�W���r����D[r�����=%qSw��^�9��T�g�^�W�4@�b�P+c'D�c�w���K�����}���l�P��\�����,������MS�0mDB(����\���v���T����Z3�����}���5iA��"$�6��lNV!�K>3�r�e����yE�VH���g��r��/�T��b����$�Y�W���B%�=��^�����xn�w�����$����Z^�&:�YPk���W��vp�-����hE� �r+ ���x2�sk��o>[��JHs�Og,{Z��X�E�H�47.��%���D�wK�
T���������n~i����B���O/,����Bn�%T��Cl%X ��������0fR`2���|�UK��;@�������&Y�� b����v����p��\�,�e�����UD��x>����@{(�����c|��}�D87�v�9J&������^ yd���+#)���uc�n�.)(}��&��OK]U����yl���zS&D�l���Q����L������
rz �n�%\l���dk�s5k��b�f{IL�h��H���������R�
�
$<�N�6V7��<'C V��N��4��nZ��c�|��!��5O��o�/��m;�_��)�������nY�����ZY�d��HB�h#���r�Ow �O�]@����Kw4�Y6��JX�Y�v�.?�����Y(r;��+����c}op5�W���"�b�`fg�]��g�^�du��'�r1<;}��������S�L�"�K
�9�� ���;�o�W����k#0��Y���!���5�3��m>��?�������!�?:+T���C�b��K������f������^�����)�-Q�9�u�
8�6H%\���#n��_?f�c>*���$xc41J� �C00HC�.�e���u ,[Lr���+���lR�F3@�u9� �)�p��S� ^("��N��h�^�r�4]I���xS�;�������Y���]U�i�c~����h=">A��}��em�cD}��/����>���_����Y���5��9-��7B`����*&Mv����LL�*$�����(��HUG�� ���L��c�g�����0555��K'z�������#���+�L��4��G���y��H��L��PI�
�F����~$~��)j��s7T��R��1��8��3�1�OF���q3[,�����L4���A���'���$<��|'!a?K&!��>�����Tp �g�Re` Y����*Qd���N1=�}�5U����a��+�A�1`t���N��u�nV�Z;�z�
9�����f.0
�$�P�I���u �o'����$(�E��S��g��������������?�>U<�T�T:{ 9[�P J�4E��afH�L����zU�(�gMM����G����l{��v�-���?�/������F�V������C�!A�^X�l�������0��K�??���s-���i�c�u�����w: �~��NaL[M���@�YDML:�����6��H�~ �)S�I&>$?�]����A�d
����Z�� f���:Z���w}?�����F��HnE��:���G�����
�[������X���\wz�1��P~���M�Sc�U@���5�����x��t9�pa4������N���[�B����"��o��"D^��+>U�./��V(�}K/�^UG*�����R��i
_H$-_�Q�+gf�6J��\utPhT���=T��^G��w���U��d���MI
�|�k(EO����[�@kO��o~����e�p�)S�����(#� <��@�������{��`^S�Pts��:_�9'�r:��W�q:`�&���p����V�Lz���C�7��&���F@l?���i]i?�V�����b\S |�x��
��H��FjfY4gx�������W������c��h/��2��r��5z�Y.�YZ��&���.|N�^���93��V�E�&���fo���:�!raI# ye+�$8�Kn����������xKI��4������aM����^$4��Ou��N�r�I-ew������o�����U�E�'����~� �^�o
�`��6���m������ M�X��{0�������#��X�L�2�K�]h2�>�<b�\��Q��$Ua�9�G� ����Lw����a(9��0
U����zv�~���W���=s��������Ts29�����M�O�������%��r��t�8�!�p�i�d,�_�:�8�[��'�6�H��*��������m��v�LTlkO�wl�I�%�{�4J��F����SEe����j�x�)!_���!�t�d"?[n
Y����K?"�C����$Y[���<v�f�������U�o-v���k{�t�� RB|���>�������a��B�=hyq�����0p����=
����?�G�/��p�'�����?�����ZP+K��*�Z�����Kt��MG���9���;��0g���s&.^PB1~����*��Z6�/ r|S'���%=���9Fl�
��8X��2��������:�����g�������F�xvAy�1,��w����y �s��%��xr�� ���D�%���� l!�����n������J�%
r�g�[���TK�~��!���"��SV���"�R�l�|�������:c��q�n�O������ �3���oqM�V���"Q�P�L,B!!*��
��� �S���7dLUm��t���!��O��34�����L�!�2#�^>:���h��z�w;�u<Z�_���
�������#����q�#���b6P�U �.K����]���(��[�~�o���E%�����O��G�+��������4�!I�G,�D�D��*�G��?Z��pvX,�i����>Z�������DS z9����R�y��:����U� ��_���W���:tI������M+,��O44���������_�?a�&
Q�$�����F�'0����.�t}�7�(�wU��]\����u�
��N;r��&�g;n0t�^5
IQ+��%E5`ND� >^%��N�]�2����|�*����>t{�Ha��g`y ��=�� ������������p�Rf��JgV6)��6j���-������g��W��_W��l ��*,R�.����i���2�O7����-W��p���Y�����7l�����K����c����Jc�.�2�~��^x��+��^8�2���W���F22_��F��������$��4�^gG�a+�K+ �{B�]�F��M#c���B�Fy����!��:{��
f����J������s_��ho�b�/Z����`OO�.�������IN����9��������[l�#�}b��>w�y4\��c�����f����z�r(��|������R��q_��,
�1%Y�m�@�|�ho���������ACu�{E�r04��GC<�-��B�o/�\<�n x%��_�QyP��i��d���|5��/#�S�aT�1A4�=H�Y�=W�WY��������!{hu��(��������?��&���O�
�
�i�:�����^����w�>@B`#S&�'`
%���=b�:����!����*P��'�wBm��,.;��/�yf�Z
z���a�"��}��x�>iu��L]#$�8�8�%&R�����oW��)� ��P7H�m�2�z$��e�yLNA8x�QR8:�QNf�luSm�O��6D�A�*���U����6����4J:wh�����8�i+����<���aG�Xo6�8�P��\im8�T��P���2��*^���G�qC�zw��p�H)��x���W�/^� ��I�lk�����5�}D+�$���f�J�dZ_�3`�z�*Q���*��kD[�-R��z#�jDb1� ��
�t���s<�����S��W?���G��C�z�8�9�O��s�W�VI����������������������j������ �Q�q��(�� ��^��
^'��^��m����Vx����C���2�Z,#f�:R+8d��<�oh|�����%�ee=rvmbg!����o�_�)�w��������lF�N�
#?����
�,��]���������T�����.�^�DI*�]6�����4������p��i�����'R?��-`]���|�6���4�,;
g�4���4�F�T1 f�_f��%}S�i���@���F�Ry"�|w�q�&�����+��;���"